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Most graphics rendering algorithms used in both animated feature films and real time games 
can enjoy the performance and quality boost that comes with temporally reusing previous compu- 
tation. However, there is a lack of proper rendering benchmarks that would allow people to have 
detailed and objective comparisons between different temporal methods. Currently, very slowly 
moving cameras, improper scenes, and animations are used, which results in an unequaled play- 
ground for comparisons, having an obvious bias towards the proposed novel methods. 

In this thesis, we describe a framework that can be used to capture 3D animations out of in- 
teractive scenarios and compile them to a dataset that is compatible as a dynamic benchmark. 
The capturing framework is used in the creation of two datasets: EternalValleyVR and EternalVal- 
leyFPS. We verify the quality and the dynamic challenge these datasets put on the algorithms. By 
surveying the input features used in the state of the art temporal reuse algorithms, we form metrics 
of change in features that happen throughout the animation. The proposed dynamic benchmarks 
are shown to surpass the previously released animations in temporal complexity. 
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Useimpia renderöintialgoritmeja, joita käytetään elokuvissa ja peleissä, voidaan nopeuttaa ja 
parantaa laadullisesti uudelleenkäyttämällä aiemmin laskettua informaatiota. Mutta näille tempo- 
raalisille algoritmeille ei ole olemassa kunnollisia vertailutestejä. Tällä hetkellä renderöintialgorti- 
meja testataan hitailla kameroilla, kertakäyttöanimaatioilla sekä testiskeneillä. Tämä johtaa sel- 
västi epäreiluun asetelmaan verrattavien algoritmien kanssa, koska skenet voivat olla suunniteltu 
näyttämään omat algoritmit hyvässä valossa. 

Tässä diplomityössä luodaan tallennusjärjestelmä, jolla voi taltioida temporaalisiin vertailutes- 
teihin sopivia dynaamisia 3D-animaatioita. Järjestelmää demonstroidaan luomalla kaksi vertailu- 
testitiedostoa EternalValleyVR ja EternalValleyFPS. Luotujen animaatioiden temporaalinen laatu 
halutaan varmistaa. Käytännössä tämä tehdään perehtymällä moderneihin temporaalisiin algorit- 
meihin, sekä luomalla niiden käyttämien parametrien perusteella sopivat vertailumetriikat. Metrii- 
koita käytetään kahden luodun testitiedoston vertaamiseen yleisesti käytössä oleviin animaatioi- 
hin. Vertailun perusteella EternalValleyVR ja EternalValleyFPS ovat huomattavasti hankalampia 
renderöitäviä aikaisemmin julkaistuihin animaatioihin verrattuna. 


Avainsanat: Grafiikka renderöinti, Temporaalinen uusiokäyttö, Polunseuranta, Säteenseuranta, 
Datasetti, Suorituskykytesti 


Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla. 


PREFACE 


This project has received funding from the ECSEL Joint Undertaking (JU) under grant 
agreement No 783162 (FitOptiVis). The JU receives support from the European Union’s 
Horizon 2020 research and innovation programme and Netherlands, Czech Republic, 
Finland, Spain, Italy. 


Work was done at Tampere University while working as a research assistant in the uni- 
versity's Virtual Reality and Graphics group. First and foremost, | would like to thank 
Prof. Pekka Jääskeläinen for the opportunity to finish my thesis and trust to work on this 
subject, and my supervisor Dr. Markku Makitalo, for the guidance and feedback. 


| would like to thank Julius for suggesting the initial idea and inspiration for this thesis. 
Thanks to the rest of the graphics group for the fruitful and informative brainstorming. 
Indeed, the work would be of less if not for those conversations. 


In addition, ld like to thank my family and friends, who have, throughout my years of 
study, given me tireless love and support. Especially I’d like to thank my grandfather Risto 
for the compassionate encouragements in my studies. Finally, | want to thank Riina for 
relentlessly cheering me up while writing the thesis. 


Tampere, 26th April 2021 


Joel Alanko 


CONTENTS 
bg te ee oak ee eae e are GAs es oe ee oe er es ee ee 1 
2 Graphics Rendering| . = <-s-x-s-s-2%3-% 435 Kk k X e544 9846448 4 4 
2.1 Light Transport]. . 2 oe Oke kä as ae Sa BEY siksi € 4 
2.2 Animations|. .. a a8 a SRK SR Oe a a MTA. X ee ws 9 
3 Temporal Rendering]... .. +... k kkk kkk kk kkk k kkk . . . 2. 4 16 
3.1 Reuse Algorithms] . k 16 
3.2 Benchmarking lemporal Rendering|. . «kk e 22 
3.2.1 Benchmark Regulrements| «sk. 22 
3.2.2 Dataset Comparison Metrics} . ..................... 22 
3.2.3 Animation Capturing Methods] ................... 26 
4 Related Work]... ............ . . . . . . . . mm... 27 
4.1 Rendering Performance BenchmarkS|. .......... 0... ..... 27 
4.2 Rendering Benchmarks] . .......... +... .... .. . . . .. 28 
4.2.1 The Utah 3D Animation Repository] . ................ 29 
4.2.2 NVidia ORGA] — < usiessa ace e ee MK 30 
5 Capturing dataset from Cube 2: Sauerbraten|. .................. 32 
5.1 High Level Dataset Description inglTF2.0)................. 32 
5.1.1 Comparison with Other 3D Animation Formats|. .......... 32 
5.1.2 UsedgllF 2.0 Features and Extensions] .............. 33 
5.2 Sauerbraten Rendering Loopf . ....... +... o... ...... . .. 35 
E a o E A E KE 8084 38 
5.3.1 Offline Start Up Captures| . . o... .. .. .. .. 38 
5.3.2 Runtime Captures|. ............. .. . e... . o... 41 
wah, & eg ee Mh da oe ee a a a 43 
5.5 Capturing Virtual Reality «k. 45 
A 2 2-8-4 «< we eae Br a ee SE G STE GN E SH ee Evät eee Eon ON a 5 46 
6.1 Dataset ETODE USS. 6 . . ee ee eS he Oe we oS Boe BR Be oo 46 
6.2 amera Movement] k kN 48 
dra E Seek aoe o 51 
ete Te ee ee ee ee ee ee ee ee ee 53 
References| s sutr oe s 313-0230208 2 WE EPR 308-808 So oe eso ew N 54 
Appendix A Appendix]... oaoa 64 


A.1 Pseudocode of linear blend skinning of vertices in Vulkan vertex shader] . 64 


A.2 Datasets’ camera rotation animation in Euler angles 
A.3 Datasets’ camera rotation animation distance} .. . 


LIST OF SYMBOLS AND ABBREVIATIONS 


B Local bone transformation matrix 
Le Power of radiance emitted from ray hit surface 
Lin Incoming radiance to a point 
L out Radiance leaving a point to a direction 
M Bind pose matrix 
R Rotation matrix used in transformations 
S Scale matrix used in transformation 
T Translation vector used in transformations 
Q Hemisphere over a ray hit point 
Qdiscard Threshold value for the discard function 
Q Temporal accumulation blending factor 
Angle of pitch, yaw or roll rotation 
@ Angle of rotation around a vector 
(+) Reprojection operator 
0 Angle between surface normal and incoming light direction 
c Camera position vector 
dp Change in translation 
d, Change in rotation angle 
d Dual quaternion 
Fr Function to accumulate pixel’s color history 
fy Bidirectional reflectance distribution function 
Jaisearä Discard function for invalid pixels 
fpercentägé Function to retrieve the percentage of image pixels discarded 


p,n, u, U,C Vector-valued variables 
q Unit quaternion 
t Time parameter 


x,y,2z,w,h — Single value variables 


3D 

Al 

API 
APSNR 
ASCII 


BART 
BRDF 
CAD 
CBR 
CC-0 licence 
CPU 
DAE 
DLSS 
FBX 
GGX 
gITF 2.0 
GPU 


NVidia RTX 
OBJ 

ORCA 

PBR 

PBRT 

PC 


Three dimensional 

Artificial intelligent 

Application programming interface 

Average of peak to signal noise ratio in animation 


American standard code for information interchange is a character 
encoding 


A benchmark for animated ray tracing 

Bidirectional reflectance distribution function 
Computer aided design 

Checkerboxing rendering method 

Creative commons no rights reserved licence 
Graphics processing unit 

Collada is 3D file format by Khronos Group 

Deep learning supersampling 

Filmbox is a proprietary file format by Autodesk 

A microfacet distribution 

Graphics language transform format version 2.0 specification 
Graphics programming unit 

Head mounted display 

Heads up display 

Intellectual property 

3D Model benchmark 

3D file format by id Software 

Max Planck Institute 

Multisample anti-aliasing 

File format for material definition 

NVidia ray tracing hardware 

Fileformat for geometry definition by Wavefront Technologies 
NVidia open research content archive 

Physically based rendering, a modern material model 
Physically based research renderer 


Personal computer 


vii 


PSNR 

RGB 

spp 

TAA 

TRS 

UNC 

USD 

UT AnimRep 
VR 


Peak to signal noise ratio 

Additive color model with red, green and blue lights 
Samples per pixel 

Temporal anti-aliasing 

Affine transformation matrix 

University of North Carolina 

Universal scene description is 3D file format from Pixar 
The utah 3D animation repository 


Virtual reality 


viii 


1 INTRODUCTION 


Graphics rendering refers to methods for synthesizing from a virtual three-dimensional 
mathematical model to a display. Graphics rendering is commonplace in entertainment 
industries like animated feature films and games. It is often utilized in design phases 
with computer aided design (CAD) across almost all other industries, including industrial, 
product, and architecture industries. 


A pleasant and interactive experience requires the display to update a new image in high 
frequency. However, rendering a realistic image takes time. It has been noticed that 
the next frame is often very coherent with the previous one in dynamic rendering. This 
coherency can be utilized so that the computational efforts are not wasted. These are 
called temporal reuse methods, which means that the previously rendered image is used 
in some way to help to render the following image. 


Often when the performance of methods and processes is compared, benchmarks are 
created and used. Benchmarks contain reproducible test scenarios that are used as an 
input for algorithms. The results can then be compared with the confidence that the test 
was performed in an appropriate setting. 


For temporal reuse algorithms, a benchmarking setting would be a dataset that contains 
3D data and animations required in the image rendering. With benchmarks, it would be 
easier to compare the advancements in algorithm development, having access to previ- 
ously understood and used dynamic datasets. Moreover, a benchmark would clarify the 
field of temporal rendering, showing how and where the state of the art algorithms suc- 
ceed and fail in rendering good quality animations. It would also serve as a challenge to 
motivate pushing forward rendering development. 


Dynamic datasets could also be helpful in the machine learning and deep learning neural 
network research that has recently gained much interest. The data-driven area of deep 
learning must have a large amount of data available so that the networks can learn as 
much as possible from the different inputs. In graphics rendering, this could be utilizing the 
quickly moving camera’s transformation information in predictions of what the rendering 
should focus on. Also, the scenery deformations could determine what parts contribute 
to the final image. With such information, sophisticated occlusion culling or acceleration 
structure updates could be automated. 


However, there are very few such animations released in public, and graphics research 
rarely uses them. There are a few obvious reasons for this. First, gathering and creat- 
ing these datasets takes time and effort, and polishing them to a release quality would 
increase it even more [1]. Second, there is no single clear animation format to select from 
because there are plenty of standard file formats used across the industry. The papers 
use animations to produce convincing results, but rarely are publicly available datasets, 
or the used animations are released to the public. And three, the datasets are either cre- 
ated by the authors themselves, they own an IP they can use, or they buy an animation, 
which cannot be released to the public. The fact that animations are only present in the 
research papers’ results serves as a bias towards the apparent novelties researchers are 
proposing, as it is impossible to reproduce the exact same case. 


For comparison, there are standardized benchmarking datasets like these for static single 
image rendering algorithms. For example, the Sponza scene, with its simple geometry 
and reasonably complicated material models, is commonly used in real-time rendering 
algorithms, and the San Miguel scene is used with offline path tracing rendering algo- 
rithms [2]. Few datasets have animations, but they lack the most complicated scenarios 
that are commonplace in practice. Temporal rendering complexity comes with a guickly 
moving camera and fast-paced animations. Currently, these datasets only have slowly 
flying cameras interpolating from point to point, which does not often map with the actual 
use case [3]. The cameras shake and move irregularly when used in interactive scenar- 
ios. 


In virtual reality (VR), the screen may shake even more than with PC or mobile appli- 
cations, and it is also more sensitive to issues regarding bad quality rendering [4]. The 
VR research community is also lacking such dataset, and so do the other head mounted 
display (HMD), and screen rendering research areas of light-field and augmented reality 
(AR). 


The signal processing research community uses a similar subset of temporal feature 
buffers in the motion flow algorithms [5] [6] [7]. Motion estimation can be used, for exam- 
ple, in automated car driving tasks. These datasets lack the required 3D world information 
for temporal reuse algorithms, but better datasets would help them too to generate new 
datasets for motion flow. 


This thesis focuses on the following research questions: 


e What are the fundamental features currently used in temporal reuse algorithms? 
+ What makes a temporally challenging dataset to be used in benchmarking? 


* How can the temporal challenge be measured and compared between different 
animations? 


e What are the existing dynamic benchmarks? 


* How can the dataset be captured from an interactive scenario, like a game, and 
what is stored in the dataset? 


This thesis’s primary goal is to produce a framework that can be utilized to create dynamic 
datasets. We recognize that games already have complex animations accompanied by 
varying lighting settings that are hard to render in real-time, making an excellent dataset 
capture platform. To understand what makes an animation temporally challenging to ren- 
der, we review graphics and 3D animation background for both real-time rasterization and 
path tracing and then familiarize ourselves with the modern temporal reuse algorithms. 
We analyze the inputs they often use and present metrics that can be used to compare 
datasets between each other. Finally, we use the proposed framework in the creation of 
two datasets and verify that the produced datasets have similar complexity in materials 
and geometry but show an increase in the challenge for the temporal algorithms. 


The structure of the thesis is as follows. In the Chapter [2] we start by establishing the 
necessary 3D graphics rendering theory. Then, in the Chapter [3] we introduce the state 
of the art advancements in temporal reuse algorithms. We also present the benchmark- 
ing metrics that can be used to compare datasets' feature buffers we recognize utilized 
in temporal rendering algorithms. After that, in Chapter [4] we review previously released 
rendering benchmarks and dynamic datasets. Next, we present the capturing framework 
in the Chapter [5] and use it to capture and create two benchmarking datasets from an 
open-source game called Cube 2: Sauerbraten. Summarized results are presented in 
Chapter [6] Finally, in the conclusion Chapter [7] we discuss the work done and the out- 
come of the thesis. 


2 GRAPHICS RENDERING 


This Chapter describes necessary theory and methods standardized in computer graph- 
ics in creating and displaying virtual worlds to a computer screen. We start with declaring 
the rendering equation: simple concept, yet effective in practice giving realism to the 3D 
scenes by tying together computer graphics and actual physical lighting phenomena. The 
equation itself is approachable, but evaluating it has been the centermost issue in graph- 
ics rendering since it was introduced. Then we declare how surfaces react to the lighting 
using surface material definitions and their underlying microfacet structures. After that, 
we continue to show how geometry is produced out of simple mathematical concepts and 
primitives. Then, we show the standard practice on how virtual cameras and lights are 
described in computer graphics. Finally, we introduce rigidbody and armature animations 
and describe practical representations for moving and rotating objects in virtual worlds. 


2.1 Light Transport 


To synthesize an image to a screen, a standard way is to have a virtual camera that looks 
at 3D scene [8]. There are two popular approaches to this. First, the rasterization uses 
graphics processing units (GPU) to preprocess 3D world geometry primitives and finally 
calculate colors for each pixel in a vastly parallel manner. Rasterization is the standard 
for games, utilizing GPU application programming interfaces (API), like OpenGL, Vulkan 
and Direct3D12, to access function pointers and memory of the GPU [9] [11]. The 
second common approach is Monte Carlo path tracing, which we will describe shortly. In 
both methods, conceptually, each pixel is colored according to a ray shot from a virtual 
camera hitting a point in the scene, using the underlying material of the hit surface. Like 
in real life, the lighting condition matters. The apparent color of, for example, a red apple 
is different when looked outdoors and when observed in a dark cellar. 


The rendering equation is an integral equation for all the radiance leaving a point to a 
direction. Radiance is the energy of a light source. The equation was introduced simul- 
taneously by Kajiya and Immel et.al. [13]. Simply, the equation is a sum of reflected 
radiance and emission: 


Lout = Le + f Linfrcoshdw, (2.1) 
Q 


Figure 2.1. Physically based materials applied on top of Blender Suzanne model. Vertex 
encoded texture coordinates are used to sample the four PBR textures. From the left, the 
textures are base color, normal map, metallic, and roughness. 


where, Lout iS radiance to a view direction, L, is the amount of emission, that is the power 
emitted of the hit surface, (2 means that the integration is performed over the hemisphere 
around the point through all the negative incoming angles w, L;n is the incoming radiance, 
fr- denotes the bidirectional reflectance distribution function (BRDF) and cos0 the angle 
between surface normal and incoming light direction. 


3D image synthesis is primarily solving this equation for each pixel. Commonly in modern 
rasterization, the equation’s integral is simplified to a sum of the radiance contribution. 
The contribution is formed from all of the light sources in the scene shading the surface 
multiplied by underlying color 114]. In Monte Carlo path tracing, hundreds and thou- 
sands of paths are shot from the camera through the imagined screen's pixel to the scene, 
and when the ray touches a surface, it reflects other random directions. Each reflected 
part of the camera path contributes slightly to the final pixel’s color, and when performed 
thousand times with randomly selecting the directions, the result has more realistic light- 
ing than simple rasterization. The reason for the apparent realism comes from lighting 
effect called indirect lighting, or sometimes called global illumination. It is a subtle effect 
that comes from light bouncing and reflecting all surfaces. Trying to solve or approximate 
it has been a research interest for a long time. 


Incoming radiance L;n for a camera c and a given view ray v can be noted with 


Lin(c, —v) = Lout (p, v); 


where the Lout is the outgoing radiance from a surface intersection point p, where the 
view ray has hit. We assume there are no participating media, like smoke, and we are 
also interested only in solid surfaces that do not let light pass through it or refract it. Here, 
radiance can be emitted straight by the surface itself, which is shown in the Eq. (2.1). 
More commonly, the surfaces do not emit radiance by themselves but reflect some light 
emitted elsewhere. A given point reflects incoming radiance in the opposite direction. 
Locally, these surface reflectance phenomena and their subsurface scattering is denoted 
by BRDF f,(l, v), where / is the incoming light direction and v the outgoing view direc- 
tion [15]. 


When any surface is examined at the micro and atom level, different types of irregulari- 
ties can be observed. The realistic simulation of light scattering distributions of surface 
materials have received decades of research attention [16] [17], However, more recently, 
a physically based rendering (PBR) microfacet material model has been re-found and 
widely adopted, called Trowbridge-Reitz / GGX [19]. The GGX model includes a ge- 
ometry term, a Fresnel term, and a normal distribution term, which combines different 
physical microfacet phenomena approximations. The geometry term approximates shad- 
owing and masking that happens on a micro surface level. The normal distribution term 
approximates the distribution of the surfaces normals, and the Fresnel term approximates 
how reflected light depends on the viewing angle. In addition to viewing directions and 
light directions, these BRDF terms require only three parameters: how metallic the object 
is, how rough the surface is and what is the index of refraction (IOR) for the object [19]. 
With these, almost every solid object can be synthesized in the virtual worlds. 


Materials are the basic format to pass around the surface specifications in different 3D 
softwares. They often consist of textures, which are images that contain pixel-wise pa- 
rameters for each of the reflectance terms, and other parameters, like whether the surface 
should be transparent or emit some light. Textures map pixel-specific roughness and met- 
alness values, but Fresnel is usually a constant for the whole material [19]. The popular 
metallic roughness has a less physically accurate model than the earlier Phong shading 
model [20], but it is easier to work with. This workflow allows a convenient mental model 
for artists to approach authoring GGX materials, selecting whether the material has metal- 
lic parts in the metallic texture and how rough or smooth the surface is to the roughness 
texture. An example of how the GGX workflow textures are mapped on a model is de- 
picted in Figure 2.1] Using texture coordinates, we can sample parts of surface triangles 
and retrieve the used GGX settings with coordinates. Texture coordinates are simply 2D 
image coordinates. 


Virtual 3D scene's geometry can be represented by connecting three vertices, 3D points 


triangles vertices 


normals 


Figure 2.2. A vertex, a triangle, and a normal perpendicular to surface annotated on 
Blender Suzanne model. 


in space, together, forming a triangle as shown in Figure|2.2| A ray can be imagined shot 
out of the camera towards the triangle and intersecting it. Color can then be fetched from 
the intersection point. Moreover, by taking the encoded color of each vertex, a blended 
color can be formed. Forming is done by weighting with their location differing from the 
intersection point. If such methods are followed further, any complex scene can be formed 
by combining thousands and millions of triangles. This is the basis of all 3D graphics 
[21]. In addition to simple triangles, there are guads, often used in facial expressions and 
other surface curvatures expressed in mathematical forms. They are commonly described 
as faces, a primitive patch of the surface. In this work, we focus mainly on the triangles. 


The next most important primitive after vertex is a normal, a vector pointing outwards of a 
surface. For a smooth surface, it is perpendicular to the surface tangent on a point. In 3D 
graphics, they are commonly encoded for each of the vertices or faces in the scene [8]. 
They allow shading the geometry; by comparing its direction to the camera's view di- 
rection and the incoming light's direction, we can determine how much of the incoming 
radiance should be shaded to the point. The shading normal of a surface is sampled 
during the BRDF shading. Sampling is used to determine how much light reflects on a 
surface from light to the camera. If the normal points to the almost opposite direction of 
all incoming light directions, the point should be not be shaded at all. This would be the 
case of enclosed surfaces like if a camera is inside a cube and all the emission sources 
outside, the rendering would be completely black. A commonplace trick in 3D tools is to 


smooth the normals, slightly altering its direction (22). This technique lowers the 
number of faces required to produce smoothly lit surfaces. 


There are a couple of standard ways to describe lights in virtual scenes. In rasterization, it 
is common to have lights described in a singular point, and with path tracing, lights often 
have some area they sample from [8]. The most straightforward lights are directional 
lights, which do not have a location in the world, and light direction l stays constant. Then 
there are punctual lights, which are our main interest in this thesis. Punctual lights have 
a singular position and no surface area. The two most popular types of punctual lights 
are point lights and spotlights. Point light’s contribution to a surface point depends on the 
shaded point's location and the location of the lights, as we saw with the Eq. (2.1). Point 
lights emit light uniformly in all directions. It is common practice to apply a radius with 
point lights [22]. Point lights with radius can be sampled in the Monte Carlo path tracing. 
The lights area is sampled, and the probability of sampling that point is used to weigh 
its contribution to the final radiance amount. This method is called multiple importance 
sampling [23]. In addition to the punctual and area lights, emissive triangles and textures 
can emit light. 


To convert a 3D world to a 2D screen, a common practice is to use different coordinate 
systems. During the creation of 3D models, one coordinate system, second when the 
models are moved around the scene, and third when the models are finally rendered. A 
typical workflow is for a single object to be transformed from local space to world space, 
then to camera's view space, then to camera's clip space, and finally to screen space 


24,25). 


Conceptually, there are five different spaces from which we can transform to and from. 
In local coordinates, we have vertices of an object relative to its local origin. When a 3D 
object is created, its vertex positions are relative to the origin of that system. World space 
means that the observed scene now has multiple objects, and they are now relative to 
the world origin and have some offset from it [24]. Translating an object to its desired 
position in the world is done with mode! matrix, which we also call model to transform in 
this work [8]. Model transform is most actively updated during the run-time of animations 
and interactive applications, like in games when characters move around the world. If 
a virtual camera is placed in the world and we want to see what the camera sees, the 
easiest and common practice is to create a transform matrix that transforms each vertex 
in the scene so that the camera becomes the center of the world. Such matrix can be 
formed with cameras extrinsic parameters, like the position and orientation of the camera. 
This matrix brings the world into the view space. From view space, the 3D perspective 
can be achieved with a projection matrix that can be created with the camera’s intrinsic 
parameters, which are often in computer graphics described with aspect ratio, near plane, 
and far plane [8]. 


The final space conversion is from the view space to the screen. This can be done with 
a clip space matrix [8] [24]. Mathematically speaking, when we have a clip space matrix 
Metip, We can transform a world space coordinate p; by 


ZX clip 
Veli 

” | = piMaip, (2.2) 
Zclip 


Welip 
where “clip; Yelip; Zclip, Welip are the new coordinates in the camera’s homogeneous clip 


space [24]. These coordinates can be brought to a normalised device coordinates with 
perspective division: 


Tndc Lio Vaio 
Yndc = Vaio Wate ) 
¿nde Zelip/ Wetip 


where Inde, Ynde; Zndc are the coordinates now in normalised device space. Finally, we 
yield the screen space x, y coordinates by converting the coordinates from range [-1, 1] 
to [0, 1], and then multiplying them to the resolution of the rendered screen width w and 
height h: 


= 1 |. (2.3) 


Retrieved screen space coordinates x, y now represent where the vertices end up in a 
screen [24]. 


2.2 Animations 


Dynamic and interactive virtual world simulation reguires transformations. The position of 
a vertex can be changed with an affine translation with a homogeneous transformation 
matrix. The standard convention in computer graphics is to multiply translation, rotation, 
and scale matrices to a single matrix TRS, which can be compactly be sent to GPU in 
vertex shader the final position calculation [8] [25]. In three dimensional, OpenGL like right 
hand coordinate system, given p as a position vector of a single vertex can be transformed 
with TRS matrix by: 


10 


Figure 2.3. Graphics applications using Euler angle representation apply rotations 
around three axes one after another. In the middle figure, a yellow curve displays the 
path from A to B Euler angle representation may take if the rotation is done by rotating 
the red and green circles 90 degrees. The image on the right shows the shortest path 
from A to B achieved by rotating with a unit quaternion. 


SrRoo Ror Ro Tr Px 


= Rio Sy Ha Ri Ty Py 
Ra Ra SR T; Pz 


0 0 0 1 1 


where S is the scaling multiplier for each axis, R is the three-by-three Euler angle ro- 
tation matrix, T is the added translation, and p’ is the newly transformed position [24]. 
Transforming all the vertices of an object with a TRS matrix is often called rigidbody 
transformation. 


There are two sizable issues with the three-by-three rotation matrices: a gimbal lock 
and interpolating from one rotation to another. Euler angle representation describes the 
rotation around three axes. These rotations are pitch rotation, wherein the right-hand 
coordinate system rotates around the x-axis, yaw rotation, where it rotates around the 
y-axis, and roll, where it rotates around the z-axis. In Euler angle representation, these 
three rotation matrices are multiplied one after another. Given that the matrix multipli- 
cation is commutative, the order of the multiplication matters, resulting in the lock. The 
gimbal lock occurs when the last axis in the rotation multiplication chain aligns with the 
second rotation [26]. One degree of freedom is then lost, as both the second and the third 
matrices try to rotate around the same axis. The gimbal lock is often solved, for example, 
in game cameras, by placing the pitch as the last of the three rotations and limiting its 
angle between (-90, 90) degrees, where it can never align with the second rotation [27]. 
The range (-90, 90) is acceptable, as it restricts the looking to down on the ground and 
up to the sky. 


The other complication is the interpolation between two rotations. With Euler angle rep- 
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resentation, it is non-trivial to rotate along the desired geodesic curve. Even though the 
start and the endpoints of the animated rotation key frames shown in the example Figure 
2.3lare correct, the rotation from point A to point B with pitch, yaw, and roll, does not follow 
the intended shortest path. Correct curve seen in the rightmost image in Figure [2.3] can 
be achieved by representing the rotations as unit guaternions and applying spherical 
linear interpolation, shorthand slerp, between two of them [29]. Unit guaternions are a 
subset of guaternions that represents rotation and can be defined as follows: 


g = Qu + jqy + kqz + Qw, 


where g is the unit quaternion, qw is the real part, and qz, gy, q. the imaginary part where 
i,j,k are the imaginary units [8]. Imaginary vector part qx, gy, qz of the unit quaternion 
supports all the usual vector operations, such as scaling, addition, cross product and dot 
product. Unit quaternions may also be written 


g = singu + coso, 


where u is a three-dimensional vector such that ||u||= 1 and & is the amount of rota- 
tion around that vector. With unit guaternion g that when multiplied with its multiplicative 
inverse g ', the equation 


qtq =q = 1, 


holds true [8]. The desired shortest arc between the points A and B of the Figure 
can be achieved with slerping from unit guaternion q, to unit quaternion q, with software 
implementation friendly format 


sterp( qe, qt) = OE Herne et 


Jo, 


where t € [0,1] is a parameter between the two guaternions [8]. The intrinsic geodesic 
distance between two unit guaternions that is, the angular change in rotation d, can be 
calculated with: 


dr (Gi, 01) = I[In(qi-1 as) | (2.4) 


where g; and q;_; are the compared rotation unit quaternions [30]. 


Unit quaternions are handy: it can be shown that a point represented in a vector format 
p= (dy Pain) is rotated by unit guaternion g 
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Figure 2.4. The top image displays skin, and the bottom the skeletal bones and the 
influence the forearm bone has on vertices. In the bottom image, the red color indicates 
the joint's total influence, yellow and green that there is some influence, and blue that the 
forearm joint does not influence the vertex skinning. The character model is Michelle from 
Mixamo animation library [31]. 


gpg’, 


around axis u for angle 2¢ [8]. When only rotation is reguired, this four component repre- 
sentation of a rotation is a preferred way compared to the nine component rotation matrix, 
and it is used in memory-constrained systems, like graphics rendering on GPU [25]. 


Rotating and translating rigidbodies allows animations to have only rigid motions. In 1988, 
Magnenat-Thalmann et al. proposed a new technigue called vertex skinning in the ani- 
mation world [82], and it has since become standard in all rigging software, 3D games, 
and animations [33] [34]. This technique has been converted to a practical method. In the 
method, a skeleton is applied in addition to the mesh to help modify and animate vertices. 
This procedure simplifies animating characters with human-like motion [35]. 


Skeletal representation requires two things: a mesh, which is often also called a skin, and 
a hierarchy of bones. A popular name for the combination of the skin and the bone hier- 
archy is armature [22]. When the bones are animated, they control the movement of the 
corresponding vertices in the skin. In addition to the armature, predetermined animation 
poses are declared, especially in an interactive context like games. For example, when 
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a game character runs and then jumps, the animations are interpolated between key 
frames of the running animation and also blended between running and the next jumping 


animation [33]. 


A bone hierarchy is represented with a transformation from a bind pose [35]. A bind pose, 
and sometimes called rest pose, is the starting point for the animations, and all transforms 
are described as offsets from this pose. The skin’s vertices must have the information on 
which joint influence them and with what weight. in Figure [2.4] on the top image, skin is 
shown, and on the bottom image, a forearm bone and its weights are presented. Vertices 
under the influence of red-colored areas are transformed fully by the forearm bone, blue- 
colored areas are not influencing at all, and yellow to green gets some influence from this 
joint [85]. Bones are in a relational hierarchy to each other. Transforming, for example, a 
knee of a character will also transform the rest of the foot recursively along with it. The 
skinning procedure simplifies the animation procedure: only a single rotation matrix is 
changed, and the whole leg and all of its vertices are moved. 


Vertex skinning is more formally called linear blend skinning. The vertices are linearly 
blended near the joints. The world space position for each joint can be computed as the 
weighted average: 


N 

p =) UBM n 

i=0 

where p' is the new blended vertex world space position, N is the number of bones 
affecting the given vertex, w; the weight the indexed bone influence this vertex, B; is 
the local animation transformation for the indexed bone, M; is a bind pose matrix that 
transforms bone's coordinate system to world coordinates and p is the original vertex 
position [8]. Often with human characters, N is four, as it allows most poses and does 
not reguire too much data to compute on the GPU [27]. 


For shading, we also need to update normals. We compute the normals naively with a 
similar weighted average: 


N 
n = Y wlB:M7') "m, 
i=0 
where n’ is the new blended shading normal, N the amount of bones influencing the 
normal, (B;M,')~7 the inverse transpose of each blending matrix, in which the M; is 
the bind pose matrix and B; is the animation matrix 35]. However, taking a transpose 
is inefficient, and the computed normal is considerably inaccurate. A proposed method 
tackles the issues and achieves better quality normals and more precise results [36]. 
An example implementation of simple skinning in the vertex shader is presented in the 
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Figure 2.5. Linear blending (left) and dual quaternion blending (right). Dual quaternions 
have a lot fewer skinning artifacts compared to linear blending. In linear blending, with 
extreme angles, the mesh collapses, whereas in the dual quaternion blending, the volume 
is preserved, resulting in a bulging effect. 


appendix |A.1]that is applicable in modern graphics API Vulkan [10]. 


Linear blending has an issue with extreme blending angles where the geometry collapses. 
The issue can be observed on the left cylinder in Figure [2.5] The issue can be fixed with 
dual quaternions |37|. They extend the basic unit quaternion with a concept of dual 
numbers. Dual numbers are expressions in the form of 


d =a + be, 


where d is the dual number, a and b are real numbers and e is the complex symbol to 
satisfy €? = 0 [87]. Quaternions are already basically dual numbers, with their real and 
imaginary parts. Translation can also be converted to dual number: 


0, T, Y, z 
diranslation — (1, 0, 0, 0) + SSN 
where deranslation IS the translation in dual number, and x, y, z are the coordinate change 
in the vertex position. This dual number representation of a translation can be combined 


with a rotation to a single, dual guaternion: 


(0, T, Y, z) 


diransform = diranslationdrotation = rotation T 2 Jrotation€, 
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where dirans form is the dual guaternion that can perform both the rotation and the trans- 
lation, diranslation is the translation in dual number form and drotation is the rotation in dual 
number form [37]. Dual guaternions can be decomposed back to the rotation in guater- 
nion format and the translation in vector format [37]. When blending is performed with 
dual guaternions, the volume of the skin is preserved, as seen on the right cylinder in 
Figure 2.5] Dual quaternions are also quicker to blend, with generally over 20% improved 
performance, and they take only eight components, compared to 16 that TRS matrices 
take, making them perfect for skeletal animations [37]. 


In addition to rigidbody and skeletal animations, the final animation method is to animate 
each vertex explicitly. This method is called vertex morphing, and sometimes the name 
shape keys animation is used [8]. Morphing is most commonly used in facial anima- 
tion, where each key frame of the animation has new positions to all vertices, which are 
interpolated between [38]. 
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3 TEMPORAL RENDERING 


Rendering workload can be lowered by taking advantage of the fact that subsequent 
frames are very similar. In this Chapter, we look at temporal reuse algorithms utilizing 
the frame-to-frame image coherence to speed up the rendering. Then we define require- 
ments for an excellent benchmarking dataset for this domain. In the following section, 
we present metrics for comparing how temporally challenging different datasets are for 
rendering. Finally, we review the previous methods used for capturing animations. 


3.1 Reuse Algorithms 


There is a body of research done both path tracing and by the real-time constrained ras- 
terization and ray tracing to reuse temporal data to have better performance [39]. Given 
the time complexity of rendering an image, shortcuts are often taken to lower the count 
of sampled paths and the length of the path in the Monte Carlo integration or by lowering 
the sample count of spatial anti-aliasing methods like Multisample Anti-aliasing (MSAA). 
The continuous image function must be sampled in high frequency so that the final pixel 
receives anti-aliased color. Recent survey identified the two components of Temporal 
Anti-aliasing (TAA): sample accumulation and history validation [40]. 


The figure [3.1] shows a generalized execution flow of the TAA algorithms. Each frame, 
the renderer first streams necessary image features to few clever steps. Motion vectors 
are the pixel’s velocity during the animation, which can be used to reproject where the 
pixel was in the previous frame. A color sample, a returned color value of a single camera 
ray, is validated against the reprojected history by utilizing other features. With the knowl- 
edge of whether the history is valid or not, it can be rectified. Finally, the new sample 
is accumulated on top of the history. Then, it can be sent for the next frame to use and 
post-processing and display. 


In TAA, different sample positions are selected for each frame that aligns temporally to a 
supersampled result, which amortizes the cost of supersampling [41]. Sample positions 
are jittered with a sampling strategy, state of the art being low discrepancy patterns like 
Halton or Sobol seguence, which evenly distributes samples and mitigates problems if the 
temporal integral should be restarted 143). Amortized pixel's history buffer is shown 
in Figure where new sample positions are accumulated in each frame. The exact 
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Figure 3.1. The diagram shows a typical flow of execution and used procedures in tem- 
poral reuse algorithms, like TAA. First, the feature buffers are rendered. Then, using the 
history buffer of the previous frame, the history of the current pixel can be validated, and 
with a new sample, the history might be rectified. After that, the new sample is blended 
with the history colors. The final color can be sent to the screen and for the next frame to 
use. 


process works on other methods requiring sample integration. These effects include 
diffuse global illumination, ambient occlusion, shadows and reflection [40]. 


After a pixel has been reprojected and history retrieved, new samples are then constantly 
accumulated and blended together: 


fnlp) = a: 8n(p) + (1— a): fn-1(T(P)), 


where f,(p) is current frame n's color at given pixel p, a is the blending factor, s,,(p) is 
frame n’s new sampled color at pixel p and f,,1(7(p)) accumulated pixel history from 
previous frame, that has been reprojected with reprojection operator 7(-). The blending 
factor can vary the rate the oldest samples are forgotten from the color history. 


Reprojecting a pixel is simple if not the scene nor the camera have changed from the 
previous frame, but with the dynamically changing scene, it is not as easy. Common 
practice is to use reverse projection displayed in Figure [3.2][41]. Similar but the opposite 
approach is to forward reproject, where the previous frame's pixels are projected on the 
current frame, but it is less efficient on graphics hardware [44]. The known motion of 
the pixels movement, motion vectors, are used to sample where the current pixel was 
located in the previous frame and accumulated accordingly. It is common practice to 
use hardware-accelerated bilinear texture fetch when retrieving the previous color from 
the history color texture in a real-time context. The fetch mitigates issues compared to 
naive reprojection, as it might result in in-between pixels and camera sub-pixel offset that, 
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ES Disoccluded area 


Figure 3.2. Sampling the history of a target pixel. The history buffer (left) has been 
formed from previous sampling positions. Often, these sampling positions are not saved 
separately but accumulated as the final color of the previous frame. On the current frame 
(right), a new input sample is taken from the image blended with a bilinearly sampled 
history of the previous frames pixel position. The disoccluded red area appearing behind 
the blue triangle would require more samples to affect the final pixel value. 


due to the jittering, results in non 1:1 mapping between the previous and the current 
frame. Jittering means that the sample position is offset from the center of the pixel. In 
the bilinear texture fetch, the wanted color is interpolated from four nearby samples and 
weighted accordingly as seen in Figure [3.2] In motion, this has a downside of introducing 
resampling blur, as it softens the resampled image. Resampling happens in each frame 
so that the high-frequency detail can be quickly lost. 


High-quality motion vectors are not trivial to come by with [45]. Often game engines create 
a motion vector texture that has the offset between current and previous pixel's location by 
transforming the geometry twice with current and previous camera matrices [46]. Prob- 
lems arise when shading is not geometry related, which is the case with transparent and 
highly reflective materials, lights, and shadows [45]. 


Another issue with sample accumulation is the reintroduction of aliasing artifacts for al- 
ready smooth edges on moving objects. There are few methods proposed to solve this, 
for example, using a 4-tap dilation window or some other adaptive filtering schemes [47]. 
With too small blending factor a the color may start to have resampling error or tempo- 
ral lag. To avoid such issues, history is ensured to be refreshed by clamping the a to 
lower value. Motion can also be used to adapt the a so that the bilinear resampling error 
is prevented. In rasterization, there is also an issue with sampling and integrating over 
pixel area in both geometry and texture, and because textures are usually filtered with 
mipmap methods, the result might be overtly blurred. Typically a mipmap bias is applied, 
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which also helps to solve issues introduced with temporal upsampling methods, that we 
will describe shortly [46]. 


As the camera and the world move, the lighting conditions change, or there might appear 
disocclusion like in Figure [3.2] The validity of the history data should be rejected if an 
error is detected. This history's confidence can be mainly detected from two sources, 
either from geometry data, like depth, normal, motion vector, or object ID, or from the 
color or radiance data. Disocclusion can be recognized with geometry data by comparing, 
for example, the current pixel’s depth value with the pixel history’s depth value, and if they 
are further away from each other than a small scene dependant error tolerance, the pixel 
can be marked incorrect |44]. Even more robust matching can be achieved with additional 
geometry data, like surface normal or object identifier. The whole history may not need 
to be discarded, though, as we will discuss history rectification strategies later. 


With shading changes, like shadows, lighting, and reflections, geometry data is not suffi- 
cient. Checking the validity of the shading can be done efficiently with color and radiance 
data [48]. Directly comparing color indicates whether the data is invalid because of visi- 
ble light or shading change and indirect motion vectors. However, comparing aliased the 
current frame's samples to anti-aliased history might result in bias of our error estimation, 
so there have been few methods proposed to help in the comparison [41]. 


Rejecting invalid history resets the temporal integration process, which may lead to arti- 
facts [41]. Rejected data can be made more consistent by taking more samples. This is 
called history rectification [51]. The idea is to pull the color information from the 
neighborhood to our sparse and aliased samples. With the neighborhood colors, a color 
bounding box is created, and if the current sample falls inside it, history can be connected 
and rectified. Otherwise, it should be discarded. When the rectified history is blended 
with these artifacts are removed. More sophisticated color neighborhoods and gradients 
have been proposed to have better quality rectification [14]. 


Rectifying the history of a pixel also comes with downsides. Temporal artifacts, like ghost- 
ing, might appear when the history is not invalidated correctly and forgot [14]. Rectification 
techniques assume that the neighbourhood of the pixels surface point contain similar val- 
ues. Since there are only a few samples for the pixel, highly detailed content with a thin 
geometry or shading feature is easily missed, which results in dropping the pixel’s his- 
tory. Without temporal supersampling history, the quality suffers from aliasing or temporal 
instability issues, like flickering. 


Temporal upsampling is another technique to achieve a higher frame rate or resolu- 
tion [53]. The technique’s benefit is that the sampling rate is reduced from one 
sample per pixel to some portion of a sample per pixel. Upsampling results in higher 
resolution images by just accumulating lower resolution shading results. In practice, the 
input samples are upscaled to desired output resolution and then normalized to it. Nor- 
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malization can be done with the sum of the weights with a neighborhood averaging filter, 
like a Gaussian kernel. 


Like with all of the reviewed temporal algorithms, there are few issues with the upsam- 
pling methods. When upscaling is applied to the sample accumulation function, the high- 
frequency detail is lost [14]. With the temporal jittering strategy, each upscaled pixel 
receives now and then high-quality samples that have not been interpolated from the 
nearby upscaling neighborhood, resulting in temporally suitable quality pixel. Also, a con- 
fidence quality factor for each output pixel has been proposed [46]. With it, the history 
is retained when the guality of the accumulated input sample has low confidence. The 
upscaled sample accumulation also suffers from the same problems that we mentioned 
before, and now there are even fewer input samples to help solve the issues. History 
cannot be easily rectified, which leads to carefully selecting between ghosting, temporal 
instability issues, like flickering, and blurriness. 


The three-by-three neighborhood used in rectification for the color bounding box com- 
putation is now larger than the three-by-three multiplied by the upscaling multiplier, for 
example, nine-by-nine. A few different approaches have been used to help in the rec- 
tification process. In a proposed method, they compute a smaller neighborhood that is 
based on the used sub-pixel offset for the color bounding box calculation . The tempo- 
ral ghosting artifact is reduced, as the sampled points further away from the pixel are not 
blended. Similarly, a fixed two by two neighborhood has been proposed, with complimen- 
tary results [52]. One other method used in the game Quantum Break trades the quality 
of dynamic lights, shadows, and animated textures to have sharper and more stable static 
shots [54]. Their method uses motion vectors' speed to relax the color bounding box and 
uses a tighter bounding box with smaller vectors to increase the sample accumulation. 


An interesting recent finding with temporal upsampling is the noticeable differences when 
the order of the sampling interacts with dynamic geometry [52]. They recognize two 
modes for the sampling, bow tie, and hourglass, which work better when sampling mo- 
tion is horizontal or vertical. In addition to these, they show how increasing the rendering 
freguency lowers the reguired temporal sample count, as it uses the human visual aid 
system to integrate the samples into a perceptually sharper image with fewer visible arti- 
facts. 


A similar thing to upsampling is checkerboxing (CBR). In CBR, instead of applying jitter 
offset on the sampling locations, a diagonal checkerboard pattern is used. The method 
integrates temporally higher resolution images. These CBR methods often utilize a hard- 
ware dependant and accelerated sampling strategies. It was widely used in previous 
generation game consoles to achieve a unified upsampling strategy to reach 4k resolu- 
tion [55]. Given that only portion of the pixels are shaded, the history must be used to 
fill in the gaps. The CBR method suffers from the same problems that the upsampling 
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method does. 


In the virtual reality rendering motion sickness is still a common problem, but 
temporal methods have been utilized in aid for it 58]. In motion sickness, the user 
faces physical discomfort caused by perceiving visually conflicting information to what 
they experience. Practical solutions to this include lowering the latency, increasing the 
frame rate, and using high-quality motion tracking [59] [60]. 


The temporal data can approximate the result when the rendering does not finish in 
time [58]. Reprojecting the previous image to a new position when a user turns their 
head is called asynchronous reprojection, or time warp. Practically, each frame is re- 
projected asynchronously along with the actual rendering, and when the rendering time 
exceeds, the prepared reprojection can be used instead [61]. VR rendering also has 
temporal complications that are present when the scene changes a lot. Color fringing is 
an issue where colors smear to each other in some VR headsets as each of the colored 
lights, red, green, and blue, are temporally displayed one after another [4]. Then there 
is judder, that appears with guick head movements. In judder, high persistence and low 
refresh rate introduce a motion blur-like effect, where pixel colors are both smearing and 
strobing [62]. The quick head movements are the problem with the VR: even with a rapid 
refresh freguency of 60 frames per second, a human can turn their head several hun- 
dred degrees per second, and the same time gaze in the opposite direction, doubling the 
effect [4]. Furthermore, there is a complicated problem that comes from natural human 
motion: when the head is turned, but the gaze is fixed on a point, humans are used to 
seeing crystally clear the point they are focusing on with their eyes. However, this may 
not be the case, as the temporal methods may introduce some motion blur. 


Deep learning in rasterization has also recently become of interest in temporal reuse 
cases. Deep Learning Supersampling (DLSS) 2.0 by NVidia builds up from the TAA ideas 
of upsampling and temporal reuse by applying deep learning with it [63]. The network is 
taught with high guality ground truth images to learn the fundamentals of reconstruction. 
The algorithm has not been released to the public yet, but in the presentation slides, Liu 
et al. claim DLSS 2.0 to be using as input features the sampling jitter offsets, geometric 
motion vectors, depth buffers, low-resolution pixel samples, and exposure. Many of these 
are familiar feature buffers to TAA. There are no research comparisons yet with the quality 
of this rendering method compared to the previous state of the art, but the public reception 
about the perceived improvement in image guality and rendering performance has been 
overall positive [64]. 


There are not too many temporal reuse methods in the world of path tracing, as there is 
infinite time to produce the image. The recent development of trying to render the image 
with path tracing in real-time has introduced some solutions and problems. These include 
using small sample counts and spatio-temporal features to denoise the image 166]. 
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Even though these methods converge to clear images with global illumination with static 
scenes, when temporal motion is introduced with the camera, with the scenery motion, or 
the lighting conditions change, they start to blur, and the global illumination begins to lag 


behind [66]. 


3.2 Benchmarking Temporal Rendering 


Benchmarking helps to better understand the underlying problem and identify parts that 
still reguire improvements. In this chapter, we first discuss some of the reguirements for 
a satisfactory dynamic benchmark. Then, we describe comparison metrics that can be 
used to compare temporal features of dynamic datasets, and finally, we review methods 
that have been previously used in the capturing of 3D animations. 


3.2.1 Benchmark Reguirements 


We are planning to publish the benchmark to be openly used in future graphics research. 
The publication plan gives us some explicit constraints for the file format selection and the 
content to be included in the dataset. We recognize that the dataset should have 


focuses on the temporal content, 


dataset is delivered in a simple and easy to use format, 


dataset is released with a permissive licence, 


it contains guickly moving camera, 3D scenery and lights, 


its material model and the geometry complexity are modern, 


it uses skeletal animations. 


Having a clear focus and using a single file to provide the dataset attracts users by having 
low overhead to try it. We aim to improve upon the previous work in the temporal gualities, 
but it should also have similar complexity in geometry and materials to previously released 
sets, like modern ORCA datasets [67]. ORCA datasets have relatively high resolution in 
their PBR textures. Therefore, the dataset format should also support and use PBR 
textures it. We also plan to capture the animation from a game, so the file format should 
also support skeletal animations. Next, we will review some of the previously used 3D 
capturing methods. 


3.2.2 Dataset Comparison Metrics 


We recognize the following errors and downsides in the currently used temporal reuse 
algorithms. These issues are blurriness, ghosting and temporal lag, temporal instability, 
and under-sampling artifacts. We aim to create a benchmark dataset that introduces situ- 
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ations where these problems are brought forward and where they can be easily examined. 
In a typical interactive scenario, like in games, the temporal coherence can be noticed in 
the majority of the screen [68]. Temporal coherence means that the pixels are precisely 
the same as in the previous frame. To compare the challenge our proposed dataset has 
temporally compared to previously released benchmarks, we measure the opposite of 
the temporal coherence: the percentage that should be discarded as invalid. We focus 
on the same features many of the temporal methods use as their inputs and compare how 
they change frame-to-frame. Namely, these features are the distance a pixel changes its 
position in the world, the shading normal used in lighting calculations, and the direct and 
indirect radiance the pixel emits towards the camera. Next, we will go through how these 
metrics are formed. 


The first metric we look at is the movement of the camera through the scene. The camera 
moves in XYZ coordinates and rotates around its center point. We calculate the distance 
d, the camera travels each frame: 


del Di, Pii) =|| Pi-1Pi II, (3.1) 


where p; is the position on this frame, and p; ; the cameras position last frame. To 
compare camera orientation, we can calculate the difference between the current frame's 
and previous frame's unit guaternions with the Eg. (2.4). In addition, we determine per 
axis rotation using 

dr (ôi, 0;-1) =|| 6: — di ll, (3.2) 


where ð; is the angle in pitch, yaw, or roll rotation for the current frame, and ö; 1 the angle 
for the previous frame. 


To receive the previous frame's history buffers, we have to reproject the current pixel 
to the previous frame. We can do this simply by projecting the current frame's world 
space position with the previous frame's camera's clip matrix, and convert to the previous 
frame's screen coordinates using the camera transforms shown earlier in the Eg. (2.3). 
Given the screen space coordinates x, y, we can discard those pixels that reside outside 
of the image's edges. We determine whether the pixels are outside of the frustum with a 
discard function ffrustum(®, Y, W, h): 


1, if(a4<0) | (w-1<kz), 
J reuse E Y, W, h) = 1, if (y < 0) | (h -l< y), (3.3) 


0, else, 


where x,y are the reprojected screen space coordinates and h,w are the height and 
width of the screen. We apply it to get the discarded percentage fhercentage(w, h) per 
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Juorsentane (w, h) = (3.4) 
where z;, y; are the reprojected coordinates retrieved with indices 2, 7 running through the 
size of the image’s w width and h height. Finally, we calculate the mean of the discarded 
pixels through the length of the animation with: 


N 

1 

N > fperceritagei (w, h), (3.5) 
i=1 


where n is the number of frames in the animation. 


Next, we introduce a discard function for each of the feature buffers, where the current 
frame's values are compared to reprojected values. For each reprojected feature buffers 
we also use bilinear sampling for the retrieved reprojected pixels [69]. 


First, we discard by the appearing disocclusions and occlusions using the reprojected 
pixel's world positions. These pixels have invalid history, as they appear behind moving 
object, as was seen in Figure [3.2| or moving object occludes previously static pixels. We 
form a metric, similar to the depth-based edge-detection estimator by [48]. Using current 
frame's world position vector p; and previous p;_;, we compare the distance of the two 
and discard those that are too far apart: 


n == 
1, if | Pi-1Pi | > AworldPosition, 
JauortaPositon (pi, Pi-1, world Position) = 
0, else, 


where amortaPosition IS a discard threshold value. This value is similar to the threshold 
used by in their transfer function, where we want to create a discard mask over all the 
pixels whose distance is too far apart from each other. We have tuned this threshold for 
each scene to mitigate the effect that each of the scenes is a different size. For example, 
a small room should have tighter thresholds compared to a small city. Scenes can also 
have been arbitrarily scaled to non-reasonable units. 


Threshold selection affects a lot of the number of pixels being discarded. With world 
position and normal feature buffers, we use similar values for each of the scenes, and 
with direct and indirect lighting, we tuned the value so that the effect of the noise in Monte 
Carlo path tracing would be minimized. Tuning the threshold due to direct and indirect 
light is similar to the exposure control in cameras, where scenery with direct sunlight and 
indoor lighting requires different settings when capturing photos. Each of the discard 
functions can also be used with the Eq. to get the screens to discard percentage 
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and taking the mean of that with the Eq. to have an idea of the average discard 
throughout the whole animation. 


We know from the rendering equation Eq. that the final color is linearly affected by 
the pixels’ shading normal. Too big of a change in the normal may result in a different 
color, and we should discard it as a low confidence pixel. The discard procedure with 
normals is similar to that with disocclusions: we use the current and the previous frames’ 
shading normals n; and n; j and their vector distance to construct a discard function 


ÍshadingNormal : 


; —> 
1, if | Ni 1Ni; | > AshadingNormal, 
area Ni1, AshadingNormal ) = 
0, else, 


Where AshadingNormal IS again a scene dependent threshold value. 


Observing the incoming light of a pixel in the rendering equation for the first light 
bounce can be thought of as direct light. The emissive surfaces L, of the first bounce 
often contribute most to the lighting, and if a powerful light, for example, the sun, is visible 
by the pixel, its power has the most significant effect on the final color. We are interested, 
for example, if the lights are no longer visible by the pixel, the area should be shadowed. 
To compare these direct light changes, we form a direct light discard function fair: 


— =" 
1, if | Lama Lair; 


> ldir, 
fair(Lair;, Liimi dir) = 
0, else, 


where Lair; is the pixel's direct radiance this frame and Lair, , is from the previous frame, 
and the ag;r is a scene dependant threshold value. 


Finally, we observe the indirect radiance the pixel receives, which results from light re- 
flected multiple times in the scenery before hitting a pixel. When the lights or the scenery 
change, the light may bounce differently around the scene. To compare this indirect radi- 
ance change, we create a discard function finadir for the current frame’s pixel: 


, == 
1, if | Linon a Lindir; 


0, else, 


> Qindir> 
finair(Lindir;, a is Qindir) = 


where Lindir, and Lindir;, , are the indirect values current and previous frame and a;ndir 
is a discard threshold value tuned for that scene. 
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3.2.3 Animation Capturing Methods 


Interactive virtual applications, like games, have many animations that provide temporal 
movement. Capturing this movement would provide an accessible way to create bench- 
marking datasets. However, to the best of our knowledge, there have not been any cap- 
turing methods published that captures the animation as it happens in a 3D world and 
records it to a general animation file. Still, similar methods, tools, and techniques exist 
that could be utilized in the animation capture. 


Standard capturing tools are the frame debuggers, which idea is to add tracing and 
recording layer in between the application and the graphics API [70]. GPU manufacturers 
and API providers each have their tools for this: NVidia NSight, Windows PIX DirectX12, 
Intel GPA framework and Apple Frame Capture Debugging Tools [71] |72} {73} [74]. In addi- 
tion to these, there are open source tools Renderdoc and Apitrace |75]. These tools 
record every call and their parameter values from the rendering application to the API in 
practice. With such information, the debugger can examine how the frame was created 
step by step, displaying every event that happened. 


In order to capture longer animation, however, these tools are not very practical. They 
support only capturing a small number of frames, and each capture halts the program for 
a short moment. Moreover, applications often apply occlusion culling and level of detail 
technigues before sending geometry to the GPU. Occlusion culling means getting rid of 
rendering that is not seen by the current camera [76]. In the level of detail, the closer 
the object is to the camera, the more defined geometry is reguired [77]. So the resulting 
capture would have varying levels of geometry details, whereas the highest level would 
be desired. These methods are examples of how only optimized geometry is rendered, 
making it impossible to render the whole captured animation. 


Capturing only the model transforms to animation has been supported by game en- 
gines like Unity and Unreal Engine [79]. In Unity, the feature is called GameOb- 
jectRecorder. It allows users to record frame-by-frame properties of game objects during 
the runtime [78]. In Unreal Engine 4 this feature is called Take recorder [79]. These can 
be used, for example, in motion capture or pre-recording physics simulations, to be later 
used in cinematics [79]. 


Some deep learning research is done on similar data to 3D rendering, mostly in motion 
flow research datasets. Motion flow datasets "Play for benchmarks" and "Play for data" 
modified games GPU shaders and captured required buffers {5} [6]. In the same manner, 
a similar approach was made with MineCraft, where camera and feature buffers were 


captured [80]. 
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4 RELATED WORK 


In this Chapter, we review previous work done for rendering benchmark datasets. First, 
we look at processors’ rendering performance benchmarks and then publicly available 
datasets with animations that could be used in benchmarking temporal methods’ quality. 
Finally, we introduce Toasters and ORCA datasets that we use to compare our proposed 
benchmarks. 


4.1 Rendering Performance Benchmarks 


Benchmarking the performance of microprocessors has standarized it’s place in ever on- 
going continuum of achievements in semiconductor industry [81]. CPU and GPU bench- 
marks test how quickly different processors solve predetermined tasks. We survey next 
both commercial and open source benchmarks that has been created to benchmark ren- 
dering capabilities. 


Commercial benchmarks are designed to test different workloads to help consumers 
and industries in their processor purchase choices. There are rendering benchmarks by 
SPEC, UL benchmarks, Unigine game engine benchmark and Cinebench by Maxon 
[85]. Rendering benchmarks takes in scene descriptions with camera settings, 
geometry, material, and animation descriptions and outputs a performance score. Some 
benchmarks also output how long rendering each step took and images or a video of the 
rendering. For example, in UL benchmarks, they have an extensive test suite, with stress 
tests and feature tests for mobile, PC, and VR. The VR tests, for example, tests whether 
the computer is capable of supporting VR. They consist of different scenes and settings 
and make standard game settings available for testing out. They measure frames per 
second performance, processor bottlenecks, and use a scoring system. In some tests, 
they test out command queues and run on Compute and Graphics pipelines. Different 
sets have different graphics enabled like ray traced reflections, tessellation, transparent 
materials, or particle simulations. 


There are a few open-source benchmarks processor benchmarks released that test ren- 
dering performance. PARSEC has benchmarks for graphics-related workload, like the 
FACESIM which tests rendering with skeletal animation-like bone weights and the RAY- 
TRACE testing ray tracing workload [86]. Splash-3 is designed to similar benchmark work- 
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loads with PARSEC, and GraalBench was released to test the performance of mobile 
processors, both also having graphics workloads [88]. These differ from commercial 
benchmarks in that they have the source code and necessary assets available. 


4.2 Rendering Benchmarks 


The camera and the scenery should change so that the renderer would have reuse op- 
portunities and challenges. In research, instead of using standard graphics benchmarks, 
they create a small, simple animation for the given problem and do not release it to the 
public. We review the previous work with temporal elements that could fulfill this research 
rendering use case. 


A Benchmark for Animated Ray Tracing (BART) was released in 2001 [89]. The bench- 
mark has three scenes, Kitchen, Robots and Museum, that are described with Animated 
File Format, which extends Neutral File Format by adding animation properties [90]. The 
test suite has been released with benchmarking purposes to measure ray tracing perfor- 
mance and has been used in dynamic ray tracing research [91]. Each scene is designed 
with a specific stress goal in mind. The Kitchen scene has considerable differences in the 
details’ density, memory cache performance with the inclusion of hierarchical and rigid- 
body animations, and varying frame-to-frame coherency in the animations. The Robots 
scene focuses on the hierarchical animation, distribution of objects in the scene, and 
bounding volume overlapping. The final stress test is the Museum scene that focuses on 
the efficiency of ray tracing acceleration structure rebuilding. They also propose meth- 
ods to measure and compare error when datasets are used with ray tracing algorithms. 
They propose a frame quality comparison with APSNR, the average of PSNR through 
the animation, and a rendering performance comparison scheme, where the computer's 
computational efficiency is minimized. 


Moana Island Scene is a complete animation description dataset to an island featured in 
the 2016 Disney film Moana [92]. Disney provides a production-ready package to render 
the shot in their proprietary renderer Hypherion, and an additional PBRT research ren- 
derer version of the scene [21| {93}. It highlights some of the typical challenges in current 
path tracing animation production. In the dataset, they mention a challenging amount of 
geometry and complex volumetric light transport. The dataset is massive compared to 
other released datasets: the unpacked animation file size is over 131 GB, and there are 
over 15 billion primitives in Moana's Motunui island. The release of the scene from a 
highly treasured IP’s had its complications [1]. Nevertheless, it has seen some apprecia- 
tion, as there has already been researched on how to manage and render the scene with 
interactive ray tracing . There are animations described only for the procedural ocean 
and few slowly moving camera runs. 


MPI Sintel Flow Dataset is a motion flow research dataset created from open source 3D 
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animation short film, Sintel [7] [95]. Sintel is a short fantasy animation. Blender Founda- 
tion produced the animation as an open movie. The benchmark dataset contains similar 
feature buffers to those used by temporal rendering algorithms. These rendered buffers 
and described camera intrinsic and extrinsic parameters are shared open source, but ac- 
tual animation and geometry data files are behind Blender Cloud paywall [96]. These 
datasets do not come with 3D animation files, just the rendered frame images, so they 
are not compatible with our dataset. 


In addition to the Sintel, the Blender Foundation provide many openly available datasets 
in their demo files [97]. The purpose of the demo files is to display different Blender 
rendering features. Few datasets could be utilized in animation rendering benchmarks, 
like the few animations designed for Blender’s rasterizer Eevee Wanderer, Temple, and 
Ember Forest. There are also demos for physics simulations with animation, like the 
Lava animation. All of the demo files are distributed in blend file format and have varying 
licenses. However, none of these scenes have vastly moving geometry or cameras, so 
these sets were not taken to the comparison. 


KAIST Model Benchmarks have animated fracturing objects, cloth simulations, and walk- 
ing animated characters [98]. Similarly, the UNC Dynamic Scene Benchmarks have an- 
imations of breaking objects and non-rigid object deformations [99]. The downside of 
these datasets is the lack of temporally challenging scenarios. They either have slowly 
moving cameras, aged material models or do not contain moving lights and objects. 


4.2.1 The Utah 3D Animation Repository 


The Utah repository collection was created in 2001 by Ingo Wald [100]. Wald has previ- 
ously done a body of research on ray tracing dynamic scenes. He has worked on Intel's 
Embree ray tracer and currently working at NVidia on RTX, which is a GPU with ray 
tracing and acceleration hardware [101]. The datasets were released along with two re- 
search articles focusing on dynamic ray tracing 1102]. His motivation behind setting 
up a repository for the animations was to encourage more research on ray tracing for 
interactive applications. He mentions the famous Stanford 3D scanning repository as an 
inspiration [103]. We selected the Toasters dataset shown in Figure[4.1]to compare with 
our animation. 


Toasters dataset represents well how most of the released benchmark datasets are like, 
like the UNC Dynamic Scene Benchmarks and KAIST Model Benchmarks [98], with 
no camera setup and vertices morphing around the scene. The datasets consist of key 
frames as OBJ files, and their materials described with MTLs. We added a camera that 
is pointing to the scene as is in the images in the repository. For the comparison ren- 
ders, we convert the OBJ key frames to a deforming mesh animation with blender shape 


keys [104]. 
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Figure 4.1. Rendering of the Toasters scene from the Utah Animation Repository [100]. 


Figure 4.2. Renderings of the three compared ORCA scenes. From left the scenes are: 
BISTROINTERIOR, BISTROEXTERIOR and EMERALDSQUARE [3]]67). 


4.2.2 NVidia ORCA 


Nvidia Open research Content Archive (ORCA) is a professionally-created 3D assets 
library donated to research community. The donated sets include Amazon BISTROIN- 
TERIOR and BISTROEXTERIOR for Lumberyard game engine, NVidia EMERALDSOUARE 
released along with Falcor, and ZERODAY by digital creator 'Beeple' [3]/67]/105]|106] [107]. 
BISTRO datasets and the EMERALDSOUARE can be observed in Figure [4.2] BISTROIN- 
TERIOR and BISTROEXTERIOR were created to demonstrate new anti-aliasing and trans- 
parency features of the Amazon Lumberyard engine. EMERALDSOUARE was similarly 
published along with the release of the Falcor research rendering framework. All the files 
are in FBX file format |108], and they contain camera animations, modern textures, and 
modern geometry complexity. The datasets run for 60—100 seconds, and their animated 
cameras have 11—17 key frames described. These datasets are the current state of the 
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art animations to be used as rendering benchmarks for temporal rendering. The only 
problem with the datasets is that the camera flies through the scenery very slowly and 
does not represent the typical use context for the algorithms. None of these datasets are 
released for benchmarking purposes in mind, but they are great examples of what kind of 
datasets are often used by the temporal research community. 


To compare ORCA datasets in the same rendering context, we convert the files from the 
original FBX format to glTF 2.0 format using Blender. The camera animation is flipped 
compared to how the Falcor renderer displays it, so we apply a flip for each of the key 
frames, so that the camera looks in the same direction as it does with the Falcor. We also 
had to apply some scaling to the scenery so that one meter would represent around one 
unit in Blender. We applied the suggested Falcor material models in the BISTROINTERIOR 
dataset, namely the specific index of refraction values for glasses and liquids. 
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5 CAPTURING DATASET FROM CUBE 2: 
SAUERBRATEN 


In this chapter, we describe the steps to create the dynamic benchmarking datasets. 
We start by describing why the gITF 2.0 was selected as the file format by comparing 
it to other 3D formats, and then we present a high-level view of how we utilize the gl TF 
to define the dynamic benchmark. After that, we go in length to describe in detail the 
capturing workflow from Cube 2: Sauerbraten. We first start by introducing the context 
by taking a quick overview of the high-level game loop. We then describe offline and 
online capturing, then show conversion from captured intermediate format to final gITF 
file. Finally, we introduce the Unity VR setup for the TauEternalValleyVR dataset and how 
the animation is played, and how the recording is done there, followed by intermediate 
representation and conversion. 


5.1 High Level Dataset Description in gITF 2.0 


5.1.1 Comparison with Other 3D Animation Formats 


The gITF 2.0 file format was released in 2015 by the Khronos Group [109]. gITF is a file 
format specifically created to be easily used with graphics applications. Applications can 
use it right away because gITF allows describing 3D primitives in OpenGL-ready binary 
format. gITF specification consists of all information required to represent a graphical 3D 
scene. There are similar formats to the selected gITF, with different tradeoffs. File formats 
have been used a long time, and few have placed themselves as standard. Per format, 
we will check its main focus and idea, what it is for, and why it has been created. There 
are as many file formats as 3D rendering engines, so we will review only ones that would 
be the appropriate alternative to gITF 2.0. These include FilmBox (FBX), Collada (DAE), 
and Universal Scene Description (USD). 


FilmBox (FBX) file format is an industry-standard for 3D asset exchange between editors 
and game engines [108]. To be able to use FBX file format, a proprietary SDK must be 
used to handle the import and export for the binary or ASCII FBX files [110]. It supports 
all the essential 3D-defining elements: skins and meshes, textures, animation both rigid 
and skeletal, material definitions, and embedding texture images to the file, lights, and 
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camera. However, the FBX does not support modern PBR material conventions, as it 
uses Blinn-Phong to describe materials’ light reflectance. Its light description also lacks 
the current PBR standards, as it does not use physically sensible units but arbitrary light 
intensity values. The final and biggest downside is the fact that it does not have an official 
file specification. Few reverse engineering efforts have been fruitful for one of the most 
commonly used file formats, but overall being closed source makes it non-feasible for our 
use case |111]. 


A file format called Collada is also from the Khronos group [112]. gITF 2.0 is an upgrade 
and improvement over the Collada file format, so we prefer the newer one. Collada sup- 
ports all the 3D defining elements, but due to ambiguous specifications, it has not been 
widely adopted in 3D tools [112]. 


Universal Scene Description (USD) is a 3D file format from Pixar [113]. In addition to 
the basic description of 3D properties, like meshes, lights, cameras, and textures, the 
USD format supports multiuser scenarios and has layers that can be stacked on top of 
each other. USD has been created for mainly two reasons. First, it provides a universal 
language for assembling, packing, and editing 3D data across different 3D authoring tools. 
Second, USD focuses on maximizing the collaboration on the same assets and scenes. 
For our use case, one unavoidable downside is the lack of rigging supports, which is 
left out of specification to keep the file format reasonably straightforward in the previously 
mentioned targets. That, and the focus on being great for artistic iteration giving additional 
complexity, we will not be using this file format. 


There are two research renderers: Mitsuba 2 and PBRT 1114]. They both have their 
formats for scene graphs and defining materials. Their formats, however, are not sup- 
ported anywhere. Even though the renderers are used a ton in graphics research and 
have been created mainly for it. Moreover, they lack rigging support that is a critical 
feature for our use case. 


The open-source and new glTF 2.0 is the best selection for us. It provides all the required 
features, it does not have big overhead and is in an open standard format, that easily 
supported by any graphics program and it is already widely supported {22} [109]. Next, we 
will take a look at what features we need to represent the dataset. 


5.1.2 Used gITF 2.0 Features and Extensions 


A diagram of the hierarchy used to save the captured dataset is shown in Figure 
Starting from the top, we have a glTF SCENE at root. We make the SCENE node point 
to the first node that we call WORLDORIGO. Like its name suggests, it is located at the 
origo of the scene, at coordinates (0, 0, 0), and all the other nodes are placed as its child 
in the hierarchy. Having everything under only one nodes allows the user’s of the dataset 
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Figure 5.1. High level figure how the GLTF 2.0 file format is used to compose the dataset. 
Static and dynamic objects are divided in their own branches in the tree hierarchy. 


easily to rotate and translate the scene if such need appears, and when, for example, the 
dataset is imported to a renderer for further inspection or editing, all of its structures are 
under one node. 


We place two nodes under the WORLDORIGO node: STATICOBJECTS and DYNAMICOB- 
JECTS. Straightforward naming convention tells us the purpose of these two nodes, the 
STATICOBJECTS node should hold all the static geometry from the scene, and the Dy- 
NAMICOBJECTS all the objects that have animation channels described for them. With 
this division, the clients of this dataset can quickly start without any animations if they 
so desire and then continue to try with rigid and skeletal animations. This division also 
helps with simulating an interactive ray tracing scenario, where acceleration structures 
must be updated due to animations [116]. For example, all of the scenery under 
STATICOBJECTS node can be placed into a static ray tracing acceleration structure by 
default and dynamic to rapidly updated one. We place static models, the map, and the 
lights under the static branch, and dynamic camera, skeletal models, animated rigidbody 
models under the DYNAMICOBJECTS branch. 


We use two gITF 2.0 extensions in the dataset: punctual lights and image-based lighting. 
Extensions are added with "extensions used" and "extension reguired" listings on the 
top level of the gITF. The punctual lights are extensions for the gITF nodes [117]. 
Each node can point to light to animate the underlying node like we would with any other 
entities in the dataset. Node with the light extension must also point to corresponding 
light specification in lights listing in the glTF [117]. Lights can be one of three types: 
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Figure 5.2. Cube 2: Sauerbraten is an open source fast paced multiplayer arena shooter 
game. The screenshots captured from the game display different geometry and lighting 
conditions in the game. 


directional, point, or spot, from which only point lights are used with the dataset. The 
extension requires to specify the color of the light and the intensity of it. 


The other extension we use is the image-based lighting |118]. In the extension, we rep- 
resent an environment map that contains environmental lighting for the scene. The en- 
vironment map is an approximation of the sky for a virtual world defined in a texture, 
and it contributes to lighting the scene [8]. Environment maps often are displayed as 
high dynamic range images, which encodes, also, the directional light coming from the 
sun [8]. We use an environment map with a permissive CC-0 license called "Kloofendal 
48d Partly Cloudy" {119} [120]. The environment has outdoor scenery with a clear sky and 
bright sun. 


5.2 Sauerbraten Rendering Loop 


The temporal challenge is an intrinsic aspect of games, so to produce an animation 
dataset, we searched for a game that would have rapid movement for the camera and 
the virtual world. In addition to these, the game must be open source to make edits and 
apply our capturing methods. Therefore, we selected Cube 2: Sauerbraten |121]. It is 
a fast-paced arena first-person shooter game. Other valid games we considered were 
racing game SuperTuxKart, real-time strategy game 0 A.D., first-person sandbox game 
MineTest and slow-paced stealth game The Dark Mod [122] |123]|124]|125]. We selected 
the Sauerbraten, given that it would have the most significant potential of having quickly 
moving 3D objects and cameras. 


Cube 2: Sauerbraten, depicted in Figure [5.2] is a multiplayer arena shooter game, ini- 
tially developed by Wouter van Oortmerssen and Lee Salzman |126]. It diverged from 
the original Cube engine’s source code in 2002, with a focus on developing Cube’s main 
ideas further, namely geometrical representation and simplifying Cube’s internal struc- 
tures [127]. The work continues with open source contributions, and the latest version 
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Figure 5.3. Screen capture of the map Eternal Valley by 'Z3Ronic', created with the help 
of 'Hewho”, ‘Cooper’ and 'Eihrul' to the game Cube 2: Sauerbraten [132]. 


was released December 2020 [121]. The engine follows the steps of the early 2000 Id 
Software’s game engines Quake and Doom, with a particular focus on its rendering and 
world creation 1129]. As for its mesh and material formats, it uses Doom 3's MD5 file 


format [130] [131]. 


We selected a Sauerbraten map called Eternal Valley made by user named ’Z3Ronic’ 
for the dataset |132]. The map is a sizable outdoor scene, with the sky directly casting 
sunlight to the valley illuminating half of it, which can be seen in Figure [5.3] The scene 
has been released with Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License (CC BY-NC-SA 3.0) [120]. This license allows us to edit the map and 
update its textures and also permits us to publish the created dataset if we credit the 
creator and share it with a similar license. Other licenses in the game for models and 
textures are each licensed per item. Most of them have very similar licenses, varying 
between Creative Commons Deed / Attribution Non-commercial Share-Alike (AT-NC-SA 
2.5) to Creative Commons Attribution Non-Commercial (CC-BY-NC). Unfortunately, few 
items in the game did not specify any license, and few had forbidding licenses, like no 
derivatives (CC-BY-ND) |120]. We removed items with such licenses, as they would not 
allow us to release this dataset as a graphics rendering benchmark. 


To understand how and where to capture, we first present a simplified high-level game 
loop diagram of Sauerbraten in Figure[5.4] A game match in Sauerbraten starts by loading 
necessary resources for the selected game mode and the map. During loading, all the 
map entities and the player characters will be unpacked and loaded to the memory. In 
this loading step, the 3D models and animations will be unpacked from OBJ, MD3, and 
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Figure 5.4. High level diagram of the Sauerbraten’s game loop. 
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Figure 5.5. Flow diagram of how a frame is rendered in the Cube 2: Sauerbraten. 
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MD5 files to Sauerbraten's internal representation of models and animations. 


When all of the necessary data is in the memory, the game loop can start. It is looped 
through many times per second to render a frame of the game to a screen, which we will 
describe shortly with the rendering loop. Each game loop iteration follows the structure 
game engines commonly have: first, the inputs are handled, then physics are updated 
and simulated, then game artificial intelligence (Al) decides the following input for them 
and executes it, and finally, a frame is rendered [25]. The loop stops when the game is 
shut down or exited. 


Physics and Al updates contain essential parts for the capturing process. They update the 
state of all the entities in the world for the frame, and if reguired, their model transforms 
so that the rendering step can render them in the correct location [27]. These two steps 
may introduce new entities or remove existing ones during the gameplay. 


Sauerbraten's rendering is pretty sophisticated, so in this thesis, we focus only on the 
parts related to capturing process (27). We present a simplified high-level diagram of 
the execution flow that Sauerbraten uses to render a frame and display it on a screen 
in Figure [5.5] The process in Sauerbraten is executed in a single thread, making the 
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capturing process sequential. 


Next, we will advance through the Figure |5.5| step by step and present some changes 
to the rendering to enable successful capturing. First, the rendering starts by updating 
the internal time to receive the delta from the previous timestamp. Here, a correctly 
timestamped key frame capture is added, which will later form animations. 


Next in the rendering loop, the dynamic lights are updated, and additional ones may be 
added. Then the player's new camera projection and all other necessary matrices view 
matrices are calculated. The projection is updated to the camera's new pitch and yaw 
values and the player's location in the scene. Then, at step 4. in Figure[5.5] the camera’s 
frustum is used in occlusion culling and scissoring the objects not seen by the camera 
1133]. We are interested in all of the geometry and animations happening in the scene, so 
we disable the geometry culling. 


We disable skybox rendering that occurs on step 4 in Figure |5.5]| and then, during the ren- 
dering of static objects, skeletal characters, entities, and first-person models, we capture 
the reguired model transforms. These are the key to our capturing process that we will 
describe in the next chapter. In addition to these steps in the diagram, there are numer- 
ous steps, like grass and water generation and rendering the Heads Up Display [HUD], 
that we skip, as they are not of interest in our capturing procedure. Most of these effects 
we disable, but they do not modify the captured outcome, only lowers the time spent in 
rendering. 


5.3 Capture Workflow 


In Cube 2: Sauerbraten, there are entirely static data, like the map and its objects, and 
dynamic data, like the player characters. The game loads the necessary data to random 
access memory at the start of the game to not disrupt the player experience during the 
gameplay. To create a single file for the animation, we divide the work into two steps: 
offline start-up and runtime captures. The workflow for capturing is depicted in Figure|5.6 


5.3.1 Offline Start Up Captures 


In Cube 2: Sauerbraten, there are entirely static data, like the map and its objects, and 
dynamic data, like the player characters. The game loads the necessary data to random 
access memory at the start of the game to not disrupt the player experience during the 
gameplay. We perform the offline capture just before the interactive game loop starts. 
Capturing is performed during the loading resources step shown in Figure [5.4] At this 
point, Sauerbraten models and Sauerbraten animations are formed and unpacked from 
MD5Mesh and MD5Anim files. We capture this Sauerbratens internal model, as pre- 
sented in Figure [5.6] Inspired by standard human-readable file formats, like OBJ [134], 


md 5 mesh Sauerbraten 


model 
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Figure 5.6. Diagram shows how MD5Mesh and MD5Anim files together are parsed to the 
Cube 2: Sauerbraten's internal representation. We do the offline capture here for each 
used model. Instances of the models move around during runtime, and their pose change 
accordingly. Final vertex skinning happens in the vertex shaders. We capture the pose 
and the model dual quaternion just before that, after all the game related changes have 
been applied to the animations. 


we record the offline resources to an intermediate format described in Table [5.1] 


For each of the captured properties, we define a parsing header and the format for the 
data. The headers start with a unique property identifier, like p in Table which iden- 
tifies that the following data in the string line holds a vertex position. The other header 
keys are the mesh identifier, meshI D, which describes which of the meshes this model 
belongs. Character models commonly have multiple meshes in the Sauerbraten |131]. 
However, the armatures expressed with bind and inverse bind poses are shared with all 
the meshes in the model. 


Model properties’ data formats contain the data in the described format in Table[5.1] All of 
the properties follow the standard practice used in rasterization that the data is connected 
to a single vertex so that it can be efficiently processed with a vertex shader [8]. So 
the normal is the surfaces normal at the point of the vertex, and joints show which joints 
influence this vertex. 


In Sauerbraten's internal models, each vertex will have at least one joint influencing it and 
a maximum of four joints. Therefore, there may not be all of the elements present for each 
joint and weight. In such case, we show the data as symbol ? in Table 


Model's armature is described with a resting bind pose b, and an inverse bind pose x. We 
retrieve the bone hierarchy, as we add an identifier of the parent bone with jointID. In 
addition, there is a parent index for each of the joints, which can be used to link the joints 
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Table 5.1. Captured properties of a single model to ModelMesh.txt files during the offline 


capturing procedure. 
model property parsing data format 
header 
vertex position p XIyiz 
mesh ID 
normal n XiYiz 
mesh ID 
triangle index | i:j:k 
mesh ID 
texture coordinates | t U:v 
mesh ID 
joint j x :y?:z?:w? 
mesh ID 
weight w X :y?:z?:w? 
mesh ID 
texture map a texture name 
mesh ID 
texture mask map g texture name 
mesh ID 
map u Xiyiz 
map name 
camera settings c aspect ratio : yFov : zNear : zFar 
bind pose b joint ID 
joint name 
parent index 
rx: ry : rz : rw ; dx : dy: dz: dw 
per joint 
inverse bind pose b joint ID 
joint name 


parent index 
rx: ry : rz : rw ; dx : dy: dz : dw 


per joint 
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Table 5.2. Captured properties saved to Capture.txt of the game state during runtime. 


captured property 


parsing header 


data format 


frame time f frame number 
frame time 
static map model p rm 1 TY TZ rw 
capturable ID dx : dy : dz: dw 
model path 
camera transform t Xiyiz 


capturable ID 


pitch : yaw : roll 


light | XiYiz 
capturable ID R:G:B 
light type 
is dynamic? 
model type 
model transform m rm 0 Ty fz: rw 


capturable ID 


dx : dy : dz: dw 


animated joints 


model path 

| 

capturable ID 
model path 


X: TY iz: rw 
dx : dy : dz : dw 
per joint 


in a hierarchy. With bind pose, each dual guaternion of each joint holds the resting pose 
in local space format. Inverse bind pose is the opposite joint transform, moving joints from 
animation space pose to local joint space. 


The selected map is recorded only by its name and the position offset its origin has from 
Sauerbraten world's origin coordinates. The name can be later used to retrieve the model 
for the map. 


The final captured properties are the textures. Each model has corresponding texture 
maps and texture mask maps. The map has the base color, and the mask map has the 
normal map and specular glossiness information. Data that is captured is only the name 
of the used image file, then retrieved later. 


5.3.2 Runtime Captures 


Physics update and Al update depicted in the game loop Figure updates each frame's 
state. This state holds transforms, blended skeletal animations, and other entity prop- 
erties. We present a similar capturing strategy shown with the offline capturing in Table 
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Each property is identified with a parsing header in the final Capture.txt. We call the 
recorded entity instances Capturables, and add a unique identifier capturableI D for them 
with. Instances will appear and disappear during the rendering process, so we generate 
a new running identifier for them on the fly. We bind the info declared with data format as 
an animation key frame for the given Capturable instance. 


At the start of a new frame, at step 1. in Figure|5.5| we capture a new moment in time with 
a frame time. The captured moment is the time that will be applied as a key frame time to 
all of the new animation transitions that happen during rendering this frame. Transitions 
will end up in the intermediate file in chronological order simplifying the parsing process. 
However, recording key frame times means that the capturing will be dependant on the 
game rendering frequency. To ensure the recorded animations stay stable, we set a fixed 
rendering rate to 60 frames per second. 


Cube 2: Sauerbraten entities are specified to 5 different properties: static models, cam- 
era, lights, dynamic models, and skeletal models. Mesh-related properties have model 
paths declared in their header. The path can be later used to point to the correct offline 
recorded ModelMesh.txt when converting to the final animation format. 


Before capturing most of the properties, Figure [5.6] shows how Sauerbraten must first 
update their model transform, or with skeletal armatures interpolate and blend between 
animations. The transforms are represented as translations and rotations or as a dual 
quaternion. 


Moving player characters are described with both a model transform and a joint hierarchy, 
and they share capturableI D. The model transform places the model origin to the world 
coordinates, and the armature transforms the joints in model space. Sauerbraten uses 
a standard skinning procedure with the vertex shader like shown in Appendix [A.1} so it 
must prepare the animation joint matrix stack for each of the skeletons in the scene [27]. 
We capture the joints’ dual quaternions after being computed to their animated pose in 
the model space, but just before they are converted to a local joint space with their parent 
joint’s inverse bind pose [8] [25]. With such capturing strategy, we reduce the steps in the 
conversion workflow described later. 


Final Sauerbraten specific Capturables are the lights in the scene. They have a color, a 
light type, a model type, and they can be either static or dynamic. We record the color with 
RGP values the lights have internally in Sauerbraten. The light type declares whether 
the light is a point light, a directional light, or a spotlight. However, we only use point lights 
and other types we filter out during the capture parsing process. The dynamic or static 
flag tells us whether or not the light will have model transforms. The final item in the light's 
parsing header is the model type. The type is a Sauerbraten specific identifier that tells us 
whether the light entity is a flash, a bouncer, or a rocket |131]. We use different conversion 
methods for each of these types to make them look very similar to how it appears in the 
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Capturable Frame 


( alee Capture.txt Type Translate 


e Static map models ID 1 0..* | Rotate 
e Camera settings 


e Camera model matrix Mesh path / name Scale 
+ Light 
e Model matrix Original model matrix Key Frame Time 


e Joint 


Parent name 


( 


Child capturables 


Compiled mesh 
ModelMesh.txts 1 = 
e Vertex position ositions 
N 1 Normals 


e Triangle indices 
+ Texture coordinates y Model mesh Triangle Indices 


e Texture map 
e Joints 

e Weights E E 

e Bind pose joints Bone parent child mapping Texture paths 
e Inverse bind pose joints 


Skeleton joints Texture coordinates 


Vertex skinning joints 


x. Vertex skinning weights 


Figure 5.7. High-level class diagram of parsing the capture data to Capturables and 
ModelMeshes. 


game, even when the dataset is path traced. 


5.4 Conversion Workflow 


Next in the pipeline is to start building the gITF file. With the help of the parsing headers 
shown in Tables and[5.2] we parse the intermediate capture files to classes depicted 
in Figure [5.7] These classes are Capturable, Frame, ModelMesh, and CompiledMesh. 


Capturables contain the recorded entity instance information for each recorded models. 
Captures may have Frames, which describe animation properties per key frame basis. If 
Capturable does not have Frame, the Capturable is marked as static. Capturables will 
always have ModelMesh connected to it. ModelMeshes are similar to the models in Cube 
2: Sauerbraten, describing singular skeletal or rigidbody model [137]. Each ModelMesh 
will have one or more of CompiledMeshes that describe the models' 3D primitives. 


Each compiled mesh can be put to the gITF file. They are placed in Meshes in the gITF 
specification, with all the primitives described as corresponding buffers. For example, the 
vertices from the compiled mesh can be placed under glTF Mesh components’ Position 
attribute, which points to OpenGL buffer [109]. All the other primitives have a similar kind 
of place in the gITF’s mesh. The meshes need to be only defined once, and then multiple 
Capturables may point to the same mesh and its primitives. 


The model transforms used to define the placements of these Capturables in the scene. 
We use captured Frames to create animation channels for the translation, the rotation, 
and the scaling that happens during the animation |109|. Recorded key frames are ap- 
plied as-is for the animation. When the object is static, only the known placement is used 
as the node’s transform matrix. 
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compiled 
mesh 


model mesh 


capturable 


model model Kn 
primitives transforms | | transforms 


Figure 5.8. Diagram shows how the ModelMesh, CompiledMesh and Capturable are 
contributing to different parts of the final gITF dataset. glTF has the skeletal data in 
slightly different format than that of the MD5Mesh and MD5Anim or the Sauerbraten's 
internal model, so they must be rotated and converted correctly. 


Because there are appearing and disappearing objects in the captured animations, we 
use scaling to determine when the Capturable should be shown in the animation and 
when not. We set the scale of the object to O for those key frames that the Capturable 
should be invisible, and when it reappears, we return their correct scaling values to nor- 
mal. 


We convert dual guaternions to translations and rotations with the recorded skeletal ar- 
matures and decompose the skeletal parent-child hierarchy. In the gITF specification, the 
bind pose must be created with gITF’s nodes, placing them in a similar child-parent hierar- 
chy. We place the joint orientations as the resting bind pose from ModelMesh like depicted 
in Figure [5.8] Animations are simply applying the same kind of animations that was done 
with the model matrices, but now we rotate and translate these individual bones [109]. 


The camera is simple to declare with gITF: we only need to place the parameters required 
for a pinhole projection matrix. These are the information we have successfully captured 
with aspect ratio, yFov, and xFov. Its animation follows the same procedure as any other 
animation in gITF, in which the underlying node is animated. 


The final things are the lights and like previously mentioned in Chapter [5.1.2] they are 
simple extensions on top of gITF nodes. We utilize the earlier techniques to place the 
node's wrong spot in the world, and if the lights are dynamic, we add corresponding 
translation and rotation animation channels for them. Furthermore, if the light appears 
and disappears during the animation, we set their scale to zero, indicating that the light 
should not be used in rendering on this frame. We recorded three types of lights during 
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the gameplay: bouncers, flashes, and rockets. For bouncers, we add and offset six lights 
from its recorded position to around the bouncer and use the recorded color and values 
weighted so that the rendered image looks approximately the same as it does in the 
game. Rockets do not need any edits, and flashes stay only for a short amount of time. 
We amplify the intensity values of the flashes and stay few frames longer to more closely 
representing the effect that happens in the game. 


5.5 Capturing Virtual Reality 


The first-person camera does not have roll rotation, as we showed in the camera captures 
process, where we capture only pitch and yaw. For this reason, we also want to produce 
a dataset that would have a realistic VR setting in terms of temporal rendering. We load 
a recorded gITF to Unity, then we setup simple VR movement with the Oculus 2 Unity 
integration sample set, then use the Oculus Quest 2 VR set to view a previously recorded 


gameplay match and finally capture a camera recording 136]. 


Unity is a 3D game engine by Unity Technologies [135]. Unity has been used a lot re- 
cently, for example, in games Fall Guys, Valorant, and also in different industries, like in 
automotive by Audi and Toyota [137]. It also has excellent support for AR and VR. For 
example, games BeatSaber and PokemonGo have been created with it [137]. We setup 
a simple new Unity scene and add the animation file to it with gITF Unity tool package 
UnityGLTF by the Khronos Group [138]. 


Oculus Quest 2 is a VR headset [136]. The headset runs Android application packages 
without high-end host PC GPU, and it tracks the user headset motion with its cameras. A 
VR SDK provided by Oculus comes with an example program that perfectly suits our use 
case. The scene is configured to use both VR movement styles: walking and teleport- 
ing [136]. We load our animation to the scene, add collisions to the static meshes, scale 
the world correctly and configure the walking speed. We also add looping to the animator. 
With this setup, we can walk around and see the game match as an animation running 
like it was recorded from the game. 


Finally, we add a script to the created Unity that records each animation run to a Cap- 
ture.txt file, with only camera matrices for each of the frames. With the recordings, we 
use the conversion workflow described earlier to convert all the animations to camera 
runs and combine that with the previous recording we loaded to Unity. Now we have 
captured and created ETERNALVALLEYVR gITF file that represents well how the camera 
moves when used with VR applications. 


46 


6 RESULTS 


In this chapter, we compare the produced datasets against the previously published work 
discussed in Chapter [4.2] First, we present the details and properties of the animation 
datasets. Then, we perform measurements on the camera’s spatial movement during the 
animation. Finally, we reproject each frame’s pixels to the previous frame and compare 
how much the frame should be discarded as invalid history. We discard the pixel if there 
are significant enough changes in the expected feature buffers. The buffers we focus on 
for these comparisons, are often used to guide in temporal shading reuse algorithms and 
post processing effects, like the temporal anti-aliasing. Namely, we compare the changes 
in the virtual camera’s view frustum, pixel’s world position, shading normal, direct and 
indirect radiance. 


6.1 Dataset Properties 


Properties of each dataset are presented in Table [6.1] Proposed datasets are the first to 
use the file format glTF 2.0, while others use the file formats FBX and Wavefront OBJ, 
like discussed in Chapter [4.2] The face count gives some idea how complex each of the 
scenes are ORCA scenes BISTROINTERIOR and BISTROEXTERIOR being the most com- 
plex and Utah TOASTERS the least. Utah Animation repository is the only one compared 
here that does not have an animated camera, and its geometry is animated with morph 
targets. 


Next, the texture details for each dataset can be seen in Table [6.2] On the other hand, 


Table 6.1. Datasets' properties. 


file format file(s) size faces animated morph 

camera targets 
Eternal Valley FPS | oLTF 1429 MiB 240 584 Yes No 
Eternal Valley VR | oLTF 1564 MiB 292 475 Yes No 
Toasters wavefront obj 163 MiB 11 141 No Yes 
Bistro Interior fbx 559 MiB 1 248 093 Yes No 
Bistro Exterior fbx 1098 MiB 2 829 226 Yes No 
Emerald Square fbx 2459 MiB 2 691 019 Yes No 
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Figure 6.1. Rendered image of the ETERNALVALLEYVR dataset. The roll rotation is 
apparent throughout the animation. 


Table 6.2. Datasets’ texture details. 


material count texture count texture size material workflow 
Eternal Valley FPS | 89 314 1406 MiB PBR roughness metallic 
Eternal Valley VR | 89 314 1406 MiB PBR roughness metallic 
Toasters 7 6 0,62 MiB Phong 
Bistro Interior 74 212 512 MIB PBR roughness metallic 
Bistro Exterior 132 417 984 MiB PBR roughness metallic 
Emerald Sguare 220 701 2293 MiB PBR roughness metallic 


material definitions reveal how densely the geometry reacts to the lighting condition and 
how realistic the objects look when rendered. We note that Utah Animation has the least 
complexity in its textures and uses Blinn-Phong shading with just simple diffuse materials, 
whereas the rest of the datasets are using more modern physically-based materials. The 
outdoor scenes BISTROEXTERIOR and EMERALDSOUARE have considerably higher count 
of materials than others sets, with EMERALDSQUARE having over 700 different textures 
used in 220 materials. ETERNALVALLEY datasets sit in the middle of the compared sets, 
with 89 materials that use 314 textures. 


Finally, the details of the animation of each dataset have been noted in Table [6.3] Ani- 
mation sizes vary from around 200 frames to over 2400, ranging in 24 frames per second 
from 8 seconds to 100 seconds long. The proposed dataset sits in the middle, with around 
30 seconds of animation. The Utah animation TOASTERS have the animation run for the 
whole duration, but being a morphed target animation, it lacks the object identifiers, mak- 
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Table 6.3. Datasets’ animation details. 


animation static dynamic armatures static dynamic 
frames objects rigid ob- point point 
jects lights lights 
Eternal Valley FPS 760 228 3100 34 46 1266 
Eternal Valley VR 760 228 4333 32 46 1439 
Toasters 248 - - - - - 
Bistro Interior 1433 1414 0 0 4 0 
Bistro Exterior 2404 1281 15 0 1 0 
Emerald Square 1541 1030 0 0 2 0 


Table 6.4. Change in camera's position. 


mean variance max 
Eternal Valley FPS 4.711 1.741 10.855 
Eternal Valley VR 0.078 0.252 8.859 
Toasters 0 0 0 
Bistro Interior 0.013 0.000 0.018 
Bistro Exterior 0.055 0.001 0.143 
Emerald Sguare 0.154 0.004 0.248 


ing it non-trivial to utilize acceleration structures in the recent ray tracing APIs [101]. 


Like with the face count in Table the object count adds to the information on how 
complex the given scene is. There are remarkably fewer static objects in the proposed 
datasets compared to the ORCA ones. The ETERNALVALLEY sets have more dynamic 
rigidbody objects and dynamic lights than the compared datasets. The count of ani- 
mated objects continues the trend of the dataset following closer to an actual setting of 
an interactive scenario where the objects and lights movement is animated by the game 
automatically, and not by an animator by hand. Lastly, the new proposed datasets are the 
only sets with animated armatures. 


6.2 Camera Movement 


In dynamic scenes, in order for the renderer to have a temporal challenge in the rendering 
process, either the camera or the scene must change their transform. To validate the 
amount of challenge the proposed dataset opposes on a rendering client, we start by 
taking a look at the camera's change in position and rotation during the animation. 


We calculate the distance the camera moves each frame using the Eg. (3.1). The calcu- 
lated mean, variance, and the maximum value of the movement can be seen in Table [6.4] 
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Table 6.5. Change in camera’s rotation. 


mean variance max 
Eternal Valley FPS 1.931 4.526 13.665 
Eternal Valley VR 1.857 8.142 33.092 
Toasters 0 0 0 
Bistro Interior Wine 0.136 0.012 0.322 
Bistro Exterior 0.134 0.013 0.472 
Emerald Square 0.258 0.023 0.554 


The amount of camera transforms around the scene varies wildly between the datasets. 
One noticeable aspect of the movement in the proposed dataset are the continuous small 
changes in its motion compared to other datasets. Continuous change is shown in the 
more considerable variance in the ETERNALVALLEY datasets. Other datasets keep the 
constant value for a while, then jump abruptly to a new one. This shines some light on the 
main difference between the proposed dataset and the previous work: our dataset’s ani- 
mation key frames have been recorded during the gameplay with high frequency, whereas 
the previous work has the key frames placed by an animator, letting the camera fly be- 
tween marked points linearly. 


Similarly to the position, we calculate the camera’s intrinsic geodesic distance between 
the two orientation quaternions for each frame in the animation using Eq. (2.4), take their 
average, variance, and maximum value, and present them in Table|6.5 


The same variance difference can be noticed in the datasets’ rotations in Table [6.5] Fur- 
thermore, shown max values are much more significant in the proposed sets. The max- 
imum value can be considered the worst-case scenario: how significant a change will 
be between two different frames. Proposed datasets’ maximum deltas are about 20x 
compared to their average changes show that there are some challenging situations for 
temporal algorithms, whereas the compared previous works’ max values are only 3x 
higher than the average values. Regardless of the mean of the positional change being 
only slightly higher than the movement in BISTROINTERIOR and BISTROEXTERIOR, and 
more minor in EMERALDSQUARE, the diminishingly low variance in these scenes shows 
how predictable the movement is. 


While the position is dependent on the scale of the world, giving only a little information 
of the temporal challenge it proposes on the renderer, the camera’s rotation is not. The 
camera can rotate a maximum of 180 degrees per frame, so it is a better-suited metric 
when comparing datasets’ camera activity. Rotation reveals new scenery previously not 
seen by the camera, so the bigger the change, the less can be reused from the previous 
frame. Proposed datasets have spikes throughout the animation that have significantly 
more significant angular change than the previous work. This is shown as variance in 
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Figure 6.2. Rendering of Sponza scene with camera turned 25 degrees from left with 
pitch rotation upwards, center image yaw rotation to left and on the right roll rotation 7. 
On red are the frustum discard areas that does not have without any valid temporal history 
data. Quick movements especially in yaw direction rapidly invalidates all the available 
history, where as roll rotation can invalidate only the corners. 


Table 6.6. Change in cameras rotation per axis. 


pitch pitch yaw yaw vari- roll mean roll vari- 
mean variance mean ance (degrees / ance 
(degrees (degrees / (degrees (degrees frame) (degrees / 
/ frame) frame) / frame) / frame) frame) 
Eternal Valley | 1.088 2.450 3.454 17.468 0.004 0.000 
FPS 
Eternal Valley | 0.939 1.298 3.365 33.086 0.557 0.405 
VR 
Toasters 0 0 0 0 0 0 
Bistro Interior | 0.053 0.004 0.266 0.046 0.002 0.000 
Bistro Exterior | 0.014 0.000 0.265 0.051 0.004 0.000 
Emerald 0.044 0.002 0.511 0.095 0.000 0.000 
Square 


rotation in Table [6.5] 


The type of the rotation also matters for the temporal rendering challenge: the same 
amount of yaw rotation requires more frustum rendering than roll rotation like seen in 
Figure [6.2] but the roll rotation might be more challenging to temporally accumulate than 
simple horizontal or vertical rotation [52]. 


We calculate the change in rotation around each of the camera’s three-axis, pitch, yaw, 
and roll, individually using Euler angles with Eg. (3.2), and compare their means and 
variances in Table [6.6] 


The proposed datasets have a more considerable change in all angles during the an- 
imation than any other set. Moreover, compared to the others, the dataset ETERNAL- 
VALLEYVR seems to be the only dataset with considerable roll rotation, like seen in the 
rendered image in Figure This can be explained by the fact that it was captured 
from a virtual reality setup, in which the user constantly sways their head slightly during 
the recording. The most active compared animation EMERALDSOUARE has the highest 
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average in yaw rotation of the previous work, but it is still 6x smaller than the proposed 
dataset. Furthermore, the variance of that set is 649x more minor in pitch rotation and 
348 x smaller in yaw rotation. 


Time-series graphs of these rotational changes can be found in Appendix|A.2| 


6.3 Discard Percentage 


The final aspect of the datasets we are interested in is the discard percentages. To ac- 
celerate the rendering of an image, the renderer can guide its efforts to low confidence 
areas. Temporal methods accumulate pixel's history information but cannot do so when a 
new frame’s pixel reveals new geometry or the history becomes invalid as the settings for 
the lighting of the pixel have changed. Our experimental discard method follows essen- 
tially the idea behind temporal accumulation methods: using the information in previous 
frames feature buffers to validate how usable the history information is. The discarding 
process starts by first reprojecting the pixel to the previous frame, then use the cam- 
era's rotational movement to discard frustum edges fresh for the given frame, and finally, 
continue to compare feature buffers with each pixel. 


We render datasets’ animation with Blender’s path tracer Cycles and extract feature 
buffers, namely world-space positions and normals, direct and indirect radiance, sepa- 
rately [139]. Blender's Cycles is a path tracing renderer with a plethora of supported 
features. Notably, the ability to render skeletal armatures was one key reason to select 
it as the renderer for the comparisons. Other renderers designed for research purposes, 
like PBRT or Mitsuba 2, lack this feature [21][114]. All of the feature buffers are rendered 
with resolution 1920 x 1080 pixels, and for the indirect buffer, the pixels are sampled 1024 
times with 12 next event estimation light bounces. We selected these configurations to 
have reasonable rendering time and good enough sample count so that the indirect buffer 
has time to converge enough so that the noise is mostly mitigated. Rendering in higher 
resolution, using a bigger sample size and a longer sample path would result in higher 
precision discard percentages, but selected rendering configuration allows us to compare 
the datasets. 


For each inspected feature buffer, we take the discard percentage for each scene with the 
corresponding discard functions presented in Chapter [8] Finally, we sum the per frame 
percentages with Eg. and show the mean of the percentages in Table 


ETERNALVALLEYFPS has the biggest percentage of discarded pixels due to frustum dis- 
occlusion, and ETERNALVALLEYVR coming right behind. Both datasets have a much 
more significant frustum discard percentage compared to the ORCA sets. This lines up 
with the previously recognized change in the motion of the camera. The same trend 
continues with all of the discard properties, ETERNALVALLEYFPS having the highest dis- 
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Table 6.7. The percentage of discarded pixel averaged over the length of the animation. 


frustum % world shading direct indirect 

position normal% radiance radiance 

% % % 
Eternal Valley FPS 29.085 33.521 47.671 8.583 6.403 
Eternal Valley VR 19.255 29.956 23.455 5.824 8.077 
Toasters 0.0 0.214 1.073 0.973 0.432 
Bistro Interior 0.140 2.156 0.962 2.769 0.718 
Bistro Exterior 1.192 4.502 2.207 0.706 0.401 
Emerald Square 1.781 11.416 17.461 4.983 0.432 


card rate, and ETERNALVALLEYVR the second most. The dataset EMERALDSQUARE 
does have quite a sizeable discard percentage with world position and shading normal 
compared to the other ORCA sets. This most likely is explained due to the amount of 
vegetation the scene has, as it is filled with bushes and trees filled with individual leaves. 


Proposed datasets show also higher discard due to changing lighting conditions: ETER- 
NALVALLEYVR seems to have the greatest change in the indirect radiance of all the 
scenes. ETERNALVALLEYFPS on the other hand, has the most change with directional 
lighting conditions. BISTROINTERIOR and EMERALDSQUARE do have quite a bit of change 
in their direct radiance, but they are still much lower than the proposed sets. 
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7 CONCLUSION 


In this thesis, we sought to determine the current status of temporal rendering and bench- 
marks available to challenge the methods and produce applicable dynamic benchmarks. 
There are no publicly available datasets that can bring forward issues temporal reuse 
methods have. Issues include ghosting, blurring, and aliasing in rasterization, judder in 
VR, and light update lag in real-time ray and path tracing. 


This thesis produces two datasets, ETERNALVALLEYVR and ETERNALVALLEYFPS, that 
contain an excellent basis to benchmark temporal rendering. While the proposed scenes 
have fewer surface faces than most of the compared ORCA datasets, they have a higher 
amount of dynamically moving objects and are the only sets with skeletal animations, as 
shown in Chapter[6.1] We show that these produced datasets exceed previously released 
benchmarks in their camera's motion, in both cameras positional change, and in their ro- 
tation in the Chapter [6.2] The camera's animated transform determines how challenging 
the temporal reuse cases are, and we exceed the previously released datasets. The 
amount of challenge is confirmed even further in the discard percentages shown in Table 
[6.7] where we show that the features used as input to most of the temporal reuse algo- 
rithms change more rapidly on our animations. In summary, we confirm that our datasets 
have the most considerable potential of being used as benchmarks for temporal reuse 
algorithms. 


In addition to the two datasets, we introduced a framework that can capture the anima- 
tions out of an interactive system, and we verified that it works by producing our datasets. 


Interesting future work would be to generalize the framework further and use it in different 
interactive systems. The comparison metrics could also be extended to cover emissive 
triangles more depth, which are often commonplace in path tracing settings. A straight- 
forward extension would also be to separate the direct and indirect radiance to diffuse, 
glossy, transmissive, and volumetric passes. The division would allow finding a bench- 
mark that excels in challenging temporal methods in cases where the data flows from 
reflections indirectly, and more robust motion vectors must be used. 
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A APPENDIX 


A.1 Pseudocode of linear blend skinning of vertices in Vulkan 
vertex shader 


#define MAX_JOINTS 64 


layout (location = 0) in vec3 inPosition; 


layout (location = 1) in vec3 inNormal; 


layout (location = 2) in vec4 jointIndices; 


layout (location = 3) in vec4 jointWeights; 


layout (set = 0, binding = 0) uniform UniformBufferObject { 
mat4 projection; 
mat4 model; 
mat4 view; 
vec3 cameraPosition; 
} ubo; 


layout (set = 1, binding = 0) uniform UBOModel 4 
mat4 matrix; 
mat4 jointMatrices [MAX_JOINTS] ; 


} armature; 


layout (location = 0) out vec3 outWpos; 


layout (location = 1) out vec3 outNormal; 


out gl_PerVertex { 
vec4 gl_Position; 


J: 


void main() 


i 
mat4 skinMat = 


jointWeights.x * armature.jointMatriceslint(jointIndices.x)] + 
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jointWeights.y * armature. jointMatrices[int(jointIndices.y)] + 
jointWeights.z * armature. jointMatrices[int(jointIndices.z)] + 


jointWeights.w * armature. jointMatrices [int (jointIndices.w)]; 


vec4 position = 


ubo.model * armature.matrix * skinMat * vec4(inPosition, 1.0); 


outNormal = 
normalize (transpose(inverse ( 
mat3(ubo.model * armature.matrix * skinMat))) 


* inNormal) ; 


outWpos = position.xyz / position.w; 


gl_Position = ubo.projection * ubo.view * vec4(outWpos, 1.0); 
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A.2 Datasets’ camera rotation animation in Euler angles 
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Figure A.1. Figure of how much the camera rotated per frame in each of its axis 
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A.3 Datasets’ camera rotation animation distance 
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Figure A.2. Figure of the distance camera rotates in degrees per frame 


